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Abstract 

We study the theoretical properties of learning a dictionary from N signals x. ( £ for 
i = 1 ,N via li -minimization. We assume that Xj’s are i.i.d. random linear combinations of 
the K columns from a complete (i.e., square and invertible) reference dictionary Do £ M. KxK . 

Here, the random linear coefficients are generated from either the s-sparse Gaussian model 
or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and 
almost necessary condition for the reference dictionary D 0 to be locally identifiable, i.e., a local 
minimum of the expected C -norm objective function. Our condition covers both sparse and 
dense cases of the random linear coefficients and significantly improves the sufficient condition 
by Gribonval and Schnass (2010). In addition, we show that for a complete /^.-coherent reference 
dictionary, i.e., a dictionary with absolute pairwise column inner-product at most /i £ [0,1), local 
identifiability holds even when the random linear coefficient vector has up to 0{p~ 2 ) nonzeros 
on average. Moreover, our local identifiability results also translate to the finite sample case 
with high probability provided that the number of signals N scales as O(AUogiv). 

Keywords: dictionary learning, -minimization, local minimum, non-convex optimization, sparse 
decomposition. 

1 Introduction 

Expressing signals as sparse linear combinations of a dictionary basis has enjoyed great success in 
applications ranging from image denoising to audio compression. Given a known dictionary matrix 
D £ W. dxK with K columns or atoms, one popular method to recover sparse coefficients a £ R* of 
the signal x £ R rf is through solving the convex l \ -minimization problem: 

minimize ||a||i subject to x = Da. 
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This approach, known as basis pursuit (Chen et al. 1998), along with many of its variants, has 
been studied extensively in statistics and signal processing communities. See 


e.g. 


Donoho and 


Elad (2003); Fuchs (2004); Candes and Tao (2005) 


For certain data types such as natural image patches, predefined dictionaries like the wavelets 
[allat, 2008) are usually available. However, when a less-known data type is encountered, a 
new dictionary has to be designed for effective representations. Dictionary learning, or sparse 
coding, learns adaptively a dictionary from a set of training signals such that they have sparse 


representations under this dictionary ( 

Olshausen and Field 

1997 

. One formulation of dictionary 

learning involves solving a non-convex b-minimization problem 

Plumbley 

2007 

Gribonval and 


Schnass, 2010 


Geng et al., 2011). Concretely, define 

((x, D) = min {|]a||i, subject to x = Da}. 

We learn a dictionary from the N signals x, E R d for i = 1, ..., N by solving: 

N 


( 1 ) 


min Ljv(D) = min — Z(x ? -, D) 

Dec y J Dec N ^ K ’ ' 


(2) 


i= 1 


Here, T> C W. dxK is a constraint set for candidate dictionaries. In many signal processing tasks, 
learning an adaptive dictionary via the optimization problem ([2]) and its variants is empirically 


2006. 

Peyre 

2009 

Grosse et al., 2012). For a review of dictionary learning algorithms and appli- 

cations, see 

Elad 

(2010 

); 

Rubinstein et al. 

(2010) 

Mairal et al. ( 

2014 

)• 


Despite the empirical success of many dictionary learning formulations, relatively little theory 
is available to explain why they work. One line of research treats the problem of dictionary identi- 
fiability. if the signals are generated using a dictionary Do referred to as the reference dictionary , 
under what conditions can we recover Do by solving the dictionary learning problem? Being able 
to identify the reference dictionary is important when we interpret the learned dictionary. Let 


Ot i E 


»K 


for i = 1,..., N be some random vectors. A popular signal generation model assumes that 


a signal vector can be expressed as a linear combination of the columns of the reference dictionary: 
DoCtj (Gribonval and Schnass, 2010; Geng et al. 2011 Gribonval et al. 2014). In this paper, 


x, 


we will study the problem of local identifiability of (j-minimization dictionary learning ([2]) under 
this generating model. 

Local identifiability. A reference dictionary Do is said to be locally identifiable with respect to 


an objective function L{ D) if Do is one of the local minima of L. The pioneer work of Gribonval 
and Schnass (2010) (referred to as GS henceforth) analyzed the b-minimization problem ([2]) for 
noiseless signals (x* = Doct;) and complete (d = K and full rank) dictionaries. Under a sparse 
Bernoulli-Gaussian model for the linear coefficients ads, they showed that for a sufficiently inco¬ 
herent reference dictionary Do, N = 0(K log I\) samples can guarantee local identifiability with 
respect to Ln(D) in ([2]) with high probability. Still in the noiseless setting, Geng et al. (2011) 
extended the analysis to over-complete (d > K ) dictionaries. More recently under the noisy linear 


generative model (x* = Doa^ + noise) and over-complete dictionary setting, Gribonval et al. (2014) 
developed the theory of local identifiability for ([2]) with Z(x, D) replaced by the LASSO objective 
function of Tibshirani (1996). Other related works on local identifiability include Schnass (2014) 


and Schnass (2015), who gave respectively sufficient conditions for the local correctness of the K- 
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SVD (Aharon et al., 2006a) algorithm and a maximum response formulation of dictionary learning. 


Contributions. There has not been much work on necessary conditions for local dictionary 
identifiability. Numerical experiments demonstrate that there seems to be a phase boundary for 
local identifiability (Figure [I]). The bound implied by the sufficient condition in GS falls well below 
the simulated phase boundary, suggesting that their result can be further improved. Thus, even 
though theoretical results for the more general scenarios are available, we adapt the noiseless signals 
and complete dictionary setting of GS in order to find better local identifiability conditions. We 
summarize our major contributions below: 

• For the population case where N = oo, we establish a sufficient and almost necessary condition for 
local identifiability under both the s-sparse Gaussian model and the Bernoulli-Gaussian model. 
For the Bernoulli-Gaussian model, the phase boundary implied by our condition significantly 
improves the GS bound and agrees well with the simulated phase boundary (Figure [I]). 

• We provide lower and upper bounds to approximate the quantities involved in our sufficient and 
almost necessary condition, as it generally requires to solve a series of second-order cone programs 
to compute those quantities. 


As a consequence, we show that a /r-coherent reference dictionary - a dictionary with absolute 
pairwise column inner-product at most p, E [0,1) - is locally identifiable for sparsity level, mea¬ 
sured by the average number of nonzeros in the random linear coefficient vectors, up to the order 
0(p~ 2 ). Moreover, if the sparsity level is greater than 0(/r~ 2 ), the reference dictionary is gener¬ 
ally not locally identifiable. In comparison, instead of imposing condition on the sparsity level, 
the sufficient condition by GS demands the number of dictionary atoms K = 0(/i~ 2 ), which is 
a much more stringent requirement. For over-complete dictionaries, Geng et al. (2011) requires 
the sparsity level to be of the order 0(/i _1 ). It should also be noted that Schnass (2015) estab¬ 
lished the bound 0(p~ 2 ) for approximate local identifiability under a new response maximization 
formulation of dictionary learning. Our result is the first in showing that 0(p~ 2 ) is achievable 
and optimal for exact local recovery under the /]-minimization criterion. 


• We also extend our identifiability results to the finite sample case. We show that for a fixed 
sparsity level, we need N = 0(K log K) i.i.d signals to determine whether or not the reference 
dictionary can be identified locally. This sample requirement is the same as GS’s and is the best 
known sample requirement among all previous studies on local identifiability. 


Other related works. Apart from analyzing the local minima of dictionary learning, another line 


of research aims at designing provable algorithms for recovering the reference dictionary. Georgiev 


et al. (2005) and Aharon et al. (2006b) proposed combinatorial algorithms and gave deterministic 


conditions for dictionary recovery which require sample size N to be exponentially large in the 
number of dictionary atoms K. Spielman et al. (2012) established exact global recovery results for 


complete dictionaries through efficient convex programs. Agarwal et al. (2014c) and Arora et al. 


(2014) proposed clustering-based methods to estimate the reference dictionary in the overcomplete 


setting. Agarwal et al. (2014a) and Arora et al. (2015) provided theoretical guarantees for their 


alternating minimization algorithms. |Sun et al. (2015) proposed a non-convex optimization algo¬ 
rithm that provably recovers a complete reference dictionary for sparsity level up to 0(K). While 
in this paper we do not provide an algorithm, our identifiability conditions suggest theoretical 
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Figure 1: Local recovery error for the s-sparse Gaussian model (Left) and the Bernoulli(p)-Gaussian 
model (Right). The parameter s £ {1 is the number of nonzeros in each linear coefficient 

vector under the s-sparse Gaussian model, and p £ (0,1] is the probability of an entry of the linear 
coefficient vector being nonzero under the Bernoulli(p)-Gaussian model. The data are generated 
with the reference dictionary Do £ M 10xl ° (i.e. K = 10) satisfying Dq Do = pll T + (1 — /i)I for 
£ [0,1), see Example 3.5 for details. For each (p, j t) or (p,p) tuple, ten batches of IV = 2000 




signals {x,;}|£ 1 } 0 are generated according to the noiseless linear model x* = Doa^, with {aj}?™ 0 
drawn i.i.d from the s-sparse Gaussian model or i.i.d from the Bernoulli(p)-Gaussian model. For 
each batch, the dictionary is estimated through an alternating minimization algorithm in the SPAMS 


package (Mairal et ah, 2010), with initial dictionary set to be Dq. The grayscale intensity in the 


figure corresponds to the Frobenius error of the difference between the estimated dictionary and the 
reference dictionary Do, averaged for the ten batches. The “phase boundary” curve corresponds 
to the theoretical boundary that separates the region of local identifiability (below the curve) and 
the region of local non-identihability (above the curve) according to Theorem [I] of this paper. The 
“Sufficient condition (Corollary 1)” and “Necessary condition (Corollary 1)” curves are the lower 
and upper bounds given by Corollary [l] to approximate the exact phase boundary. Finally, the 
“Sufficient condition (GS)” curve corresponds to the lower bound by GS. Note that for the s-sparse 
Gaussian model, the “Sufficient condition (Corollary 1)” and “Necessary condition (Corollary 1)” 
curves coincide with the phase boundary. 


limits of dictionary recovery for all algorithms attempting to solve the optimization problem ([2]). 
In particular, in the regime where the reference dictionary is not identifiable, no algorithm can 
simultaneously solve Q and return the ground truth reference dictionary. 

Other related works include generalization bounds for signal reconstruction errors under the 
learned dictionary (|Maurer and Pontil 2010; Vainsencher et ah, 2011 Mehta and Gray 2012; 


Sommer 


Gribonval et ah, 2013), dictionary identifiability through combinatorial matrix theory (Hillar and 


2015), as well as algorithms and theories for the closely related independent component 


analysis (Comon |1994 Arora et al. 2012b) and nonnegative matrix factorization (Arora et al. 
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2012a Recht et al. 2012). 


The rest of the paper is organized as follows: In Section[2j we give basic assumptions and describe 
the two probabilistic models for signal generation. Section[3]develops sufficient and almost necessary 
local identifiability conditions under both models for the population problem, and establishes lower 
and upper bounds to approximate the quantities involved in the conditions. In Section [4| we will 
present local identifiability results for the finite sample problem. Detailed proofs for the theoretical 
results can be found in the Appendix. 


2 Preliminaries 

2.1 Notations 

For a positive integer m, define [m] to be the set of the first m positive integers, {1, ...,m}. The 
notation x[z] denotes the i-th entry of the vector x £ M m . For a non-empty index set S C [mj, 
we denote by |5| the set cardinality and xfS] £ the sub-vector indexed by S. We define 
x[— j] := (x[l],.... x[j — l],x[j + 1], ...,x[m]) £ M m_1 to be the vector x without its j-th entry. 

For a matrix A £ M mxn , we denote by A [i,j\ its (*, j)-th entry. For non-empty sets S C [mj 
and T C |n], denote by A [S', T] the submatrix of A with the rows indexed by S and columns 
indexed by T. Denote by A[i,] and A[, j] the z-th row and the j-th column of A respectively. 
Similar to the vector case, the notation A[— i, j] £ Ml 7 ™ -1 )*” denotes the j-th column of A without 
its i-th entry. 

For p > 1, the / p -norm of a vector x £ M m is defined as ||x[| p = (YaL i |x[i]| p ) 1/,p , with the 
convention that ||x||o = |{z : x[i] 0}| and Hx^ = max; |x[i]|. For any norm ||.|| on M m , the dual 
norm of ||.|| is defined as ||x||* = sup y _^ 0 

For two sequences of real numbers {a n }^ =1 and {6 n }^Ti, we denote by a n = 0(b n ) if there is 
a constant C > 0 such that a n < Cb n for all n > 1. For a £ M, denote by |_aj the integer part of 
a and [a] the smallest integer greater than or equal to a. Throughout this paper, we shall agree 
that jj = 0. 


2.2 Basic assumptions 

We denote by D C W* xK the constraint set of dictionaries for the optimization problem ([ 2 ]). In 
this paper, since we focus on complete dictionaries, we assume d = K. As in GS, we choose V to 
be the oblique manifold (Absil et ah, 2008): 


V= {D £ 


t KxK 


|D[, /c] 11 2 = 1 for all k = 1,..., A} 


We also call a column of the dictionary D [, k] an atom of the dictionary. Denote by Do £ T> the 
reference dictionary - the ground truth dictionary that generates the signals. With these notations, 
we now give a formal definition for local identifiability: 


Definition 2.1. (Local identifiability) Let L(D) : T> —> R be an objective function. We say that 
the reference dictionary Do is locally identifiable with respect to L(D) if Do is a local minimum of 
L(D). 


Sign-permutation ambiguity. As noted by previous works GS and Geng et al. (2011), there is 
an intrinsic sign-permutation ambiguity with the l \-norm objective function L(D) = L^{ D) of ([ 2 ]). 
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Let D' = DPA for some permutation matrix P and diagonal matrix A with ±1 diagonal entries. 
It is easy to see that D' and D have the same objective value. Thus, the objective function L/v( D) 
has at least 2 n n\ local minima. We can only recover Do up to column permutation and column 
sign changes. 

Note that if the dictionary atoms are linearly dependent, the effective dimension is strictly 
less than AT and the problem essentially becomes over-complete. Since dealing with over-complete 
dictionaries is beyond the scope of this paper, we make the following assumption: 

Assumption I (Complete dictionaries). The reference dictionary Do E T> C M. KxK is full rank. 

Let Mo = Dq Do be the dictionary atom collinearity matrix containing the inner-products 
between dictionary atoms. Since each dictionary atom has unit ^-norm, Mo[i, i] = 1 for all i E [AT]. 
In addition, as Do is full rank, Mo is positive definite and |Mo[i, j]\ < 1 for all i / j. 

We assume that a signal is generated as a random linear combination of the dictionary atoms. 
In this paper, we consider the following two probabilistic models for the random linear coefficients: 

Probabilistic models for sparse coefficients. Denote by z E M m a random vector from the 
AT--dimensional standard normal distribution. 


Model 1 — SG(s). Let S be a size-s subset uniformly drawn from all size-s subsets of [AT]. Define 
£ E {0,1} A by setting £\j] = I{j E S} for j E [AT], where /{.} is the indicator function. Let 
a E M m be such that a[j] = £[j]z[j]. Then we say a is drawn from the s-sparse Gaussian 
model , or SG(s). 

Model 2 — BG{p). For j E [AT], let £[j]’s be i.i.d. Bernoulli random variable with success proba¬ 
bility p E (0,1]. Let a E M m be such that a[j] = £[j]z[j]. Then we say a is drawn from the 
Bemoulli(p)-Gaussian model , or BG(p). 


With the above two models we can formally state the following assumption for random signal 
generation: 

Assumption II (Signal generation). Fori E [AT], let.oti’s be either i.i.d. s-sparse Gaussian vectors 
or i.i.d. Bernoulli(p)-Gaussian vectors. The signals Xi’s are generated according to the noiseless 
linear model: 

X i — Dq OL[. 


Remarks: 

(1) The above two models and their variants were studied in a number of prior theoretical works, 
includin g |Gribonval and Schnass (2010); Geng et al. (2011); Gribonval et al. (2014); Agarwal et al. 
( |2014b ); Sun et al. (2015). 

(2) By construction, a random vector generated from the s-sparse model has exactly s nonzero 
entries. The data points x,’s therefore he within the union of the linear spans of s dictionary atoms 
(Figure[2]Left). The Bernoulli(p)-Gaussian model, on the other hand, allows the random coefficient 
vector to have any number of nonzero entries ranging from 0 to K with a mean pK. As a result, the 
data points can be outside of the any sparse linear span of the dictionary atoms (Figure [2] Right). 
We refer readers to the remarks following Example |3.5| in Section [3] for a discussion of the effect of 
non-sparse outliers on local identihability. 

(3) Our local identihability results can be extended to a wider class of sub-Gaussian distributions. 
However, such an extension will results in an increase complexity of the form of the quantities 
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involved in our theorems. For proof of concept, for now we will only focus the standard Gaussian 
distribution. 




*1 


*1 


Figure 2: Data generation for K = 2. Left: the s-sparse Gaussian model with s = 1; Right: 
the Bernoulli(p)-Gaussian model with p = 0.2. The dictionary is constructed such that the inner 
product between the two dictionary atoms is 0.7. A sample of A r = 1000 data points are generated 
for both models. For the s-sparse model all data points are perfectly aligned with the two lines 
corresponding to the two dictionary atoms. For the Bernoulli(p)-Gaussian model, a number of data 
points fall outside the two lines. According to our Theorem [I] and [3} despite those outliers and the 
high collinearity between the two atoms, the reference dictionary is still locally identifiable at the 
population level and with high probability for finite samples. 


In this paper, we study the problem of dictionary identifiability with respect to the population 
objective function E Ljy( D) (Section [3]) and the finite sample objective function Ln(D) (Section 
[4]). In order to analyze these objective functions, it is convenient to define the following “group 
LASSO”-type norms: 


Definition 2.2. For an integer m > 2 and w E 
1. For k E If mil, define 


E|S|=J w [ 5 ]ll2 


(m— 1 \ 

U-l/ 


2. For p E (0,1), define 


m— 1 


w L = 


Y pbinom(A:; m - l,p)|||w||| fe+1 , 


k =0 
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where pbinom is the probability mass function of the binomial distribution: 


pbinom(fc; n,p) 




Remark: 

(1) Note that the above norms |||w||| fc and |||w||| are in fact the expected values of |w 7 ct| with the 
random vector a drawn from the SG(s) model and the BG(p ) model respectively. For invertible 
D E T>, it can be shown that the objective function for one signal x = DoCt is 

K 

Z(x,D) = ||Ha||i = ^|H[i,]a|, 

3 =1 


where H = D _ 1 Do- Thus, taking the expectation of the objective function with respect to x, we 
end up with a quantity involving either Y^j=i |||H[j,]||| s or l |||H[j,]||| . This is the motivation 
of defining these norms. 

(2) In particular, HlwH^ = ||w||i and |||w||| m = ||w|| 2 . 

(3) The norms defined above are special cases of the group LASSO penalty by Yuan and Lin (2006). 
For |||w||| fc , the summation covers all size-L subsets of |[m]]. The normalization factor is the number 
of times w[i] appears in the numerator. Thus, |||w||| fe is essentially the average of the Z 2 -norms of all 


size-L sub-vectors of w. On the other hand, 
probabilities. 


w 


is a weighted average of 


w 


,’s with binomial 


3 Population Analysis 

In this section, we establish local identifiability results for the case where infinitely many signals 
are observed. Denote by E Z(xi,D) the expectation of the objective function Z(xi,D) of @ with 
respect to the random signal xi. By the central limit theorem, as the number of signals N tends 
to infinity, the empirical objective function L^{ D) = ^( x e D) converges almost surely to 

its population mean E Z(xi,D) for each fixed D E V. Therefore the population version of the 
optimization problem 0 is: 

minEZ(xi,D) (3) 

Dev 

Note that we only need to work with D £ V that is full rank. Indeed, if the linear span of the 
columns of D span(D) 7 ^ M. K , then Doaq 0 span(D) with nonzero probability. Thus D is infeasible 
with nonzero probability and so E Z(xi,D) = + 00 . For a full rank dictionary D, the following 
lemma gives the closed-form expressions for the expected objective function E Z(xi,D): 

Lemma 3.1. (Closed-form objective functions) Let D be a full rank dictionary in T> and xi = DoQq 
where oq E lA is a random vector. For notational convenience, let H = D _ 1 Do- 

1. If 011 is generated according to the SG{s) model with s E \K — lj, 


2 s 


*-"\ | 


H[j, ] 


-^SG(s)(D) E Z(xi,D) 


K 


(4) 





2. If oil is generated according to the BG(p ) model with p G (0,1) 


AbG(p)(D) E /(xi,D) 



( 5 ) 


For the non-sparse cases s = K and p = 1, we have 

Pf K 

l SG ( S )( D) = L BG{P) ( d) = \m,]h- 

V j=l 

Remark: It can be seen from the above closed-form expressions that the two models are closely 
related. First of all, it is natural to identify p with , the fraction of expected number of nonzero 
entries in ai. Next, by definition, |||.||| p is a binomial average of |||.||| fe . Therefore, the Bernoulli- 
Gaussian objective function L BG ^ (D) can be treated as a binomial average of the .s-sparse objective 
function F 5G(s) (D). 

By analyzing the above closed-form expressions of the /i-norrn objective function, we establish 
the following sufficient and almost necessary conditions for population local identifiability: 

Theorem 1. (Population local identifiability) Recall that Mo = DqDo and Mo[— j,j] denotes the 
j-th column of Mo without its j-th entry. Let |||.|||* and |||.|||* be the dual norm of |||.||| s and |||.||| p 
respectively. 

1. (SG(s) models) For K > 2 and s G \K — 1], if 

“[?] i"Moi-j,j]in; < i - (ffi. 

then Do is locally identifiable with respect to L SG ^ s y 

2. (BG(p) models) For K >2 and p G (0, 1), if 

max |||M 0 [—j, j]|||* < 1 -p. 

je[A] 

then Do is locally identifiable with respect to L BG t p ) ■ 

Moreover, the above conditions are almost necessary in the sense that if the reverse strict inequalities 
hold, then Do is not locally identifiable. 

On the other hand, if s = K or p = 1, then Do is not locally identifiable with respect to L SG ^ 
or Lbg{p)- 

Proof sketch. Let {D^gR be a collection of dictionaries D; G P indexed by t G M and L(D) = 
E Z(xi,D) be the population objective function. The reference dictionary Do is a local minimum 
of L(D) on the manifold V if and only if the following statement holds: for any {D(}tgR that is a 
smooth function of t with non-vanishing derivative at t = 0, L(Dt) has a local minimum at t = 0. 
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For a fixed {D^gR, to ensure that L(D^) achieves a local minimum at t = 0, it suffices to have 
the following one-sided derivative inequalities: 


lim L(D,) ~ L(Do) 
40 + t 


> 0 and 


lim UP,) - UDo) 

40 - t 


< 0 . 


With some algebra, the two inequalities can be translated into the following statement: 


max 
je[K :] 


M 0 [-j,j] T w 


< 


1 - 5-1 
1 K—l 

1 — p 


for SG(s) models 
for BG(p) models 


where w E 


-l 


is a unit vector in terms of the norm 


or 


and it corresponds to the 


“approaching direction” of Df to Do on V as t tends to zero. Since t = 0 has to be a local 
minimum for all smooth {D^gu or approaching directions, by taking the supremum over all such 
unit vectors the LHS of the above inequality becomes the dual norm of |||.||| s or |||.||| . On the other 
hand, Do is not a local minimum if lim t 1 0 + (L(D t) — L(Do))/t < 0 or lim t + 0 - (L(D*) — L( D o ))/t > 0 
for some {DtjtgR. Thus our condition is also almost necessary. We refer readers to Section [ATL2 
for the detailed proof. 

Local identifiability phase boundary. The conditions in Theorem [I] indicate that population 
local identifiability undergoes a phase transition. The following equations 


mg IIIMol—i, j]|||; = 1 - 


and max |||M 0 [-j, j]|||* = 1 -p 


define the local identifiability phase boundaries which separate the region of local identifiability, in 
terms of dictionary atom collinearity matrix Mo and the sparsity level s or p, and the region of 
local non-identifiability, under respective models. 

The roles of dictionary atom collinearity and sparsity. Both the dictionary atom collinearity 
matrix Mo and the sparsity parameter s or p play roles in determining local identifiability. Loosely 
speaking, for Do to be locally identifiable, neither can the atoms of Do be too linearly dependent, 
nor can the random coefficient vectors that generate the data be too dense. For the s-sparse 
Gaussian model, the quantity maxj-gnq |||Mo[— j, j]|||* measures the size of the off-diagonal entries 
of Mo and hence the collinearity of the dictionary atoms. In addition, that quantity also depends 
on the sparsity parameter s. By Lemma A.3 in the Appendix, maxjgpq |||Mo[— j, j]|||* is strictly 


increasing with respect to s for Mo whose upper-triangle portion contains at least two nonzero 
entries (if the upper-triangle portion contains at most one nonzero entry, then the quantity does 
not depend on s, see Example 3.4). Similar conclusion holds for the Bernoulli-Gaussian model. 
Therefore, the sparser the linear coefficients, the less restrictive the requirement on dictionary 
atom collinearity. 

On the other hand, for a fixed Mo, by the monotonicity of maxjgpq |||Mo[— j, j]|||* with respect 
to s, the collection of s that leads to local identifiability is of the form s < s*(Mo) for some function 
s* of Mo- Similarly for the Bernoulli-Gaussian model, p < p*{ Mo) for some function p* of Mo- 

Next, we will study some examples to gain more intuition for the local identifiability conditions. 

Example 3.1. (1 -sparse Gaussian model) A full rank Do is always locally identifiable at the 
population level under a 1 -sparse Gaussian model. Indeed, by Corollary A.3 in the Appendix, 
|||Mo[— j, j]|||* = max^ |Mo[i,j]| < 1 for all j E [A'J. Thus, a full rank dictionary Do always 
satisfies the sufficient condition. 
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Example 3.2. ({K — 1) -sparse Gaussian model) For j G [if], Mo[— j, j] G M A Thus by Lemma 

X3 

ll|MoH,j]|||i- 1 = ||Mo[-j > j]||2. 

Therefore the phase boundary under the (K — 1 )-sparse model is 

mg Jl, = 1 

Example 3.3. (Orthogonal dictionaries) If Mq = I, then 


max |||M 0 [-j,j]|||* = max |||M 0 [-j, j}\\\* = 0. 

J6M JG[A] 

Therefore orthogonal dictionaries are always locally identifiable if s < K or p < 1. 

Example 3.4. (Minimally dependent dictionary atoms) Let p G (—1,1). Consider a dictionary 
atom collinearity matrix Mo such that Mo[l, 2] = Mo[2, 1] = // and Mo[i, j\ = 0 for all other i j. 
By Corollary A.J\ in the Appendix, 


max 

ie[ k \ 


l M o[-J,i]|||! = max |||M 0 [-j,. 


ie[A] 

Thus the phase boundaries under respective models are: 

s — 1 


= M- 


M = 1 - 


K - 1 


and \p,\ = 1 — p. 


Notice that when K = 2 and for the Bernoulli-Gaussian model, the phase boundary agrees well with 
the empirical phase boundary in the simulation result by GS (Figure 3 of the GS paper). 

Example 3.5. (Constant inner-product dictionaries) Let Mo = pil 7 +(1— n)I, i.e. Do[, i] 1 Do[, j] = 
H for 1 < i < j < K. Note that Mo is positive definite if and only if p G (—1). By Corollary 
~Af5 in the Appendix , we have 

Thus for the s-sparse model, the phase boundary is 


Vs\tA = 1 - 


s — 1 


K - 1 

Similarly for the Bernoulli(p)-Gaussian model, we have 


K ~i 

IIIMoH.jjlli; = \f\v{K - 1)( P bin °m(A :,K - 1 ,p)VPj . 

k =0 

Thus the phase boundary is 

1 - K ~ 1 

H = —nr- -V Pbinom (k,K - l,p)Vk. 

p ( R - fro 

Figures [ 3 ] shows the phase boundaries for different dictionary sizes under the two models. As K 
increases, the phase boundary moves towards the lower left of the region. This observation indicates 
that recovering the reference dictionary locally becomes increasingly difficult for larger dictionary 
size. 
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p 


Figure 3: Local identifiability phase boundaries for constant inner-product dictionaries, under Left: 
the s-sparse Gaussian model; Right: the Bernoulli(p)-Gaussian model. For each model, phase 
boundaries for different dictionary sizes K are shown. Note that , 1} and p E (0,1]. 

The area under the curves is the region where the reference dictionaries are locally identifiable at 
the population level. Due to symmetry, we only plot the portion of the phase boundaries for p > 0. 


The effect of non-sparse outliers. Example |3.5| demonstrates how the presence of non-sparse 
outliers in the Bernoulli-Gaussian model (Figure [2] Right) affects the requirements for local iden¬ 
tifiability. Set p = in order to have the same level of sparsity with the SG{s) model. Applying 
Jensen’s inequality, one can show that 


1 — p 


K -1 


p (k - 1) g < T(i - ^T), 


indicating that the phase boundary of the s-sparse models is always above that of the Bernoulli- 
Gaussian model with the same level of sparsity. The difference between the two phase boundaries is 
the extra price one has to pay, in terms of the collinearity parameter /i, for recovering the dictionary 
locally in the presence of non-sparse outliers. One extreme example is the case where s = 1 and 
correspondingly p = By Example 3.1, under a 1-sparse model the reference dictionary Dq is 


always locally identifiable if \p\ < 1. But for the BG(^) model, by the remark in Corollary 0 D » 


is not locally identifiable if |/z| > 1 — Hence, the requirement for p in the presence of outliers is 
more stringent than that in the case of no outliers. 

However, such a difference diminishes as the number of dictionary atoms I\ increases. Indeed, 


at least ^ 


by Lemma 3.2, one can show the following lower bound for the phase boundary of under the BG(p) 
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model: 


1 — p 
P(K - 1 ) 


K -1 


pbinom(&, K — 1 ,p)V~k > 


1 — p 


k =0 


yjp(K — 1) + 1 y/s 


^ at-d 


for fixed sparsity level p = and large K. 

In general, the dual norms |||.|||* and |||.|||* have no closed-form expressions. According to Corol¬ 
lary |A.2| in the Appendix, computing those quantities involves solving a second order cone problem 
(SOCP) with a combinatoric number of constraints. The following Lemma 3.2, on the other hand, 
gives computationally inexpensive approximation bounds. 

Definition 3.1. (Hyper-geometric distribution related quantities) Let m be a positive integer and 
d,k E {0} U JmJ. Denote by L m (d,k ) the hypergeometric random variable with parameter m, d 
and k, i.e. the number of l’s after drawing without replacement k elements from d l’s and m — d 
0’s. Now for each d E {0} U [m], define the function r m {d ,.) with domain on [0, m] as follows: set 
T m (d, 0) = 0. For a E (k — 1, k\ where k E [[m], define 

r m (d, a) = Ey/L m (d,k - 1) + (E y/L m (d,k) - Ey/L m (d, k — l))(a - (k - 1)). 

Lemma 3.2. (Lower and upper bounds for |||.|||* and |||.|||*j Let m be a positive integer and z E M m . 

1. For s E [m], 

m\i 


max 


— max 




< — max 


mTclrn ] yf\T\ I mTclrn]T m (\T\,S 


< z I < 


max 


5c[m],|5|=s 


l*[S]||2. 


2. For p E (0,1), 


max 


, s/p max 


mih 


TClmj y|T[ 
where k = \p(m — 1 ) + 1 ]. 

Remark: 


l z [ r ]lll / m hi* / n rein 

-r S z L < max z 6 h- 

TC[m]T m (\T\,pm) P Sc[mUS\=k 


< p max 


(1) We refer readers to Lemma A.8 and A.9 for the detailed version of the above results. 

(2) Since we agree that 0 = 0, the case where T = 0 does not affect taking the maximum of all 

subsets. _ 

(3) Consider a sparse vector z = (z, 0,..., 0) r E M m . By Corollary A.4 


l z IIL = lll z lllp = \ z \ = ll z lloo = 


max 

Sc[m],|SHi 


l z [S]|| 


2- 


So the all the bounds are achievable by a sparse vector. _ 

(4) Now consider a dense vector z = (z ,..., z) T E M m . By Corollary A.5 


^ii / s ||z[T]||i 

= y/s\z\ = \ — nrax - ^_ r = max 


m Tc[m] y/fTl Sclm],\S\=s 


l z [S]|| 


2- 
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Thus the bounds for |||z|||* can also be achieved by a dense vector. Similarly, by the upper-bound 

for IIMIIp, 

HHH* < a /pm + l\z\. 

On the other hand, 


lllzir > ,/p max —= ,/p\z\ max a/[t[ = Jpm\z\. 

P_V rcH# V tcH V 

Thus both bounds for |||z|||* are basically the same for large pm. 

(5) Computation. To compute the lower and upper bounds efficiently, we first sort the elements in 
|z| in descending order. Without loss of generality, we can assume that |z[l]| > |z[2]| > ... > |z[m]|. 
Thus the upper-bound quantity becomes 


max 

Sc[m],|S|=fc 


z[S]lb 


(E z w 2 ) 1/2 - 


i= 1 


For the lower-bound quantities, note that 


imiii 

max - —— - 

Tc[m] T m (\T\,k) 


max max 
de[m] Tc[m],|T|=d 


Ztl l z HI 

T m (d, k ) 


max 

de[m] 


Ell l z [^]l 

Tm(d,k) 


Thus, the major computation burden now is to compute T m (d,k ) = Ey / L m (d, k ), for all d G [mj. 
We do not know a closed-form formula for Ea/ L m (d, k) except for d = 1 or d = m. In practice, 
we compute E^ L m (d, k) using its definition formula. On an OS X laptop with 1.8 GHz Intel Core 
i7 processor and 4GB of memory, the function dhyper in the statistics software R can compute 
E-^/ Loooo(d, 1000) for all d £ [2000]] within 0.635 second. Note that the number of dictionary 
atoms in most applications is usually smaller than 2000. 

In case m is too large, the LHS lower bounds can be used. Note that 


max 

TC[m] 


z miii 

VW\ 


max 

de[mj 


E d 
i=1 


Zl 


Vd 


which can be computed easily. 


For notational simplicity, we will define the following quantities that are involved in Lemma 


[3721 


Definition 3.2. For a £ (0, A"), define 


Z'a(Mo) 


tM 0 [S,j]|| 1 
i <7<k tk-i (|S'|, a)' 


max 


max 


Definition 3.3. (Cumulative coherence) For k £ [A' — lj, define the k-th cumulative coherence of 
a reference dictionary Dq as 


m.(Mo) = max max l|Mn]5, j] IIo- 
i<j<KS<zlKl\S\=k,j/S 11 
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Remark: The above quantity is actually the I 2 analog of the l\ k-th cumulative coherence defined 

. Also, notice that i(Mo) = ma xiyj |Mo[/,j]| which is the plain mutual 
coherence of the reference dictionary. 


m 


Gribonval et al. (2014 


With the above definitions and as a direct consequence of the above Lemma |3.2| we obtain a 
sufficient condition and a necessary condition for population local identihability: 

Corollary 1. Under the notations of Theorem [7J we have 

1. Let K > 2 and s £ \K — 1]. 

• If n s { M 0 ) < 1 - then D 0 is locally identifiable with respect to Lg G ^ s y 

• If 7 ^T^( M o) > 1 - then D 0 is not locally identifiable with respect to L SG ^ s y 

2. Let K > 2 and p £ (0,1). 

• If pk{ Mo) < 1 — p, where k = \p(K — 2) + 1], then Do is locally identifiable with respect to 
Lbg( p )>' 

• If pvk{ Mo) > 1 — p, where k = p(K — 1), then Do is not locally identifiable with respect to 
L BG(p) ■ 


3.2 


if pi{M Q ) > 1 - JLA- or pi(M 0 ) > 1 - p, then D 0 is not locally 


Remark: 

(1) In particular, by Lemma 
identifiable. 

(2) We can also replace -^-j-^ s (Mo) or p^(Mo) by the corresponding lower bound quantities i 
Lemma 3.2 which are easier to compute but give weaker necessary conditions. 


m 


Comparison with GS. Corollary [l] allows us to compare our local identihability condition directly 
with that of GS. For the Bernoulli(p)-Gaussian model, the population version of the sufficient 
condition for local identihability by GS is: 

UK- i(M 0 ) = max ||M 0 [-j, j] || 2 < 1 - p. (6) 

l<j<K 

Note that gx_i(Mo) > /xfc(Mo) for k < K — 1. Thus, our local identihability result implies 
that of GS. Moreover, the quantity ||Mo[—j, j] 11 2 in inequality ([6]) computes the ^-norm of the 
entire Mo[— j, j] vector and is independent of the sparsity parameter p. On the other hand, in our 
sufficient condition max | 5 | =fcj -^5 ||Mo[S, j ]||2 computes the largest / 2 -norm of all siz e-k sub-vectors 
of Mq [—j, j] . Since k = \p(K — 2) + 1] is essentially pK, in the case where the model is sparse 
and the dictionary atoms collinearity matrix Mo is dense, the sufficient bound by GS is most 
conservative compared to ours. 

More concretely, let us consider constant inner-product dictionaries with parameter g > 0 as in 
Example |3.5| The sufficient condition by GS and the sufficient condition given by Corollary [l] are 
respectively 

\fKpL < 1 — p and \JpK + lp, < 1 — p, 

showing that the sufficient condition by GS is much more conservative for small value of p. See 
Figure [l] for a graphical comparison of the bounds for K = 10. 
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Local identifiability for sparsity level 0(p^ 2 ). For notational convenience, let p = /ii(Mo) 
be the mutual coherence of the reference dictionary. For the s-sparse model, by Lemma |3.2| 
/i s (Mo) < y/sp. Thus the first part of the corollary implies a simpler sufficient condition: 


yfsp < 1 


s — 1 
K — l 


From the above inequality, it can be seen that if 1 — > 5 for some 5 > 0, the reference dictionary 

is locally identifiable for sparsity level s up to the order 0(p~ 2 ). 

Similarly for the Bernoulli(p)-Gaussian model, since 

Pk{ M o) < \JpK + 1/j, 


we have the following sufficient condition for local identifiability: 

\JpK + 1/i < 1 — p. 


As before, if 1 — p > 5 for some 5 > 0, the reference dictionary is locally identifiable for sparsity 
level pK up to the order 0(p~ 2 ). On the other hand, the condition by GS requires K = 0(p~ 2 ), 
which, does not take advantage of sparsity. 

In addition, by Example 3.5 and the remark under Lemma 3.2, we also know that the sparsity 
requirement 0(p~ 2 ) cannot be improved in general. 

Our result seems to be the first to demonstrate 0(p~ 2 ) is the optimal order of sparsity level for 
exact local recovery of a reference dictionary. For a predefined over-complete dictionary, classical 
results such as Donoho and Elad (2003) and Fuchs (2004) show that basis pursuit recovers an 
s-sparse linear coefficient vector with sparsity level s up to the order 0(p~ l ). For over-complete 
dictionary learning, Geng et al. (2011) showed that exact local recovery is also possible for s-sparse 
model with s up to 0(p~^). While our results are only for complete dictionaries, we conjecture that 
0(p~ 2 ) is also the optimal order of sparsity level for over-complete dictionaries. In fact, Schnass 


(2015) proved that the response maximization criterion - an alternative formulation of dictionary 
learning - can approximately recover the over-complete reference dictionary locally with sparsity 
level s up to 0(p~ 2 ). It will be of interest to investigate whether the same sparsity requirement hold 
for the 1 1 -minimization dictionary learning ([2]) in the case of exact local recovery and over-complete 
dictionaries. 


4 Finite sample analysis 

In this section, we will present finite sample results for local dictionary identifiability. For notational 
convenience, we first define the following quantities: 

P 1 ( e .Af; ft A') = 2exp(-hh_), 

/ Ne 2 \ 

v,( e , AT; p, K) = 2 exp (-p^ + ) , 

r 3 (e,N-,p,K) = 3(^ + l) exp(-p^). 


16 





























Recall that Mo = DqDo and /jj (Mq) is the mutual coherence of the reference dictionary Do- 
The following two theorems give local identifiability conditions under the s-sparse Gaussian model 
and the Bernoulli-Gaussian model: 


Theorem 2. (Finite sample local identifiability for SG(s) models) Let a t £R K ,i£ [AT], be i.i.d 
SG(s) random vectors with s £ \K — lj. The signals x, ’s are generated as x* = Do***. Assume 
0 < e < 5, 


1- If 

then Do is locally identifiable with respect to Ljv(D) with probability exceeding 

1 - K 2 (Vi(e, N ; /xi(M 0 ), K) + V 2 (e, N ; ^,K) + V 3 (e, IV; R)) • 


2. If 


max |||M 0 [-j, j]\\\* 


> 1 - 


s — 1 


I\ - 1 


+ 



then Dq is not locally identifiable with respect to L^{ D) with probability exceeding 


1 - K (Vfie, N ; (M 0 ), K) + V 2 (e, IV; ^,K)+ V 3 (e, N ; ^,K)) . 


Theorem 3. (Finite sample local identifiability for BG(p) models) Let ati £ M , i £ IN}, be 
i.i.d BG(p ) random vectors with p £ (0,1). The signals x t ’s are generated as Xj = Docc,;. Let 
K v = K + 2 p~ l and assume 0 < e < 


I- If 


max |||M 0 [—j,j]||| p <l-p- \j-e 


then Dq is locally identifiable with respect to L/v(D) with probability exceeding 


2. If 


max |||M 0 [-j,j]||| > 1 -p+ \ -e 


then Do is not locally identifiable with respect to Ljv(D) with probability exceeding 
1 - K (Ri(e, N■ m(M 0 ),K p ) + V 2 (e, N-p , K p ) + V 3 (e, N;p, K)). 


The conditions for finite sample local identifiability are essentially identical as their population 
counterparts. The only difference is an margin of \J%e on the RHS of the inequalities. Such a 
margin appears in the conditions because of our proof techniques: we show that the derivative of 
Ltv is within 0(e) of its expectation and then impose conditions on the expectation. 


17 



Sample size requirement. The theorems indicate that if the number of signals is a multiple of 
the following quantity, 


1 


K 


For SG(s): max < //i(Mo)if logif, s logif, —If log — 


if 


es 


1 


1 


(^)} 


For BG(p): max |/ii(Mo)if logif, pK\ogK, -if log ( — 

then with high probability we can determine whether or not Do is locally identifiable. For the ease 
of analysis, let us now treat e as a constant. Thus, in the worst case, the sample size requirements 
for the two models are, respectively, 


K log if , if log if 

—z—) and °(—-—)• 

K V 

Apart from playing a role in determining whether Do is locally identifiable, the sparsity parameters 
s and p also affect the sample size requirement. As discussed in the population results, the sparser 
the linear coefficient oti, the less constraint on the dictionary atom collinearity. However, with finite 
samples, more signals are needed to guarantee the validity of the local identifiability conditions for 
sparse models. 

Our sample size requirement is similar to that of GS, who shows that 0( ^ 1 ] ° s ^ ) signals is 
enough for locally recovering an incoherent reference dictionary. Our result indicates the 1 — p 
factor in the denominator can be removed. 


The following two corollaries are the finite sample counterparts of Corollary [l] 
Corollary 2. Under the same assumptions of Theorem [S| 

1. (Sufficient condition for SG(s) models) If 


p s (M 0 ) < 1 - 


s — 1 
K - 1 


then Do is locally identifiable with respect to D), with the same probability bound in the first 
part of Theorem [j| 

2. (Necessary condition for SG(s ) models) If 


^Y^(Mq) > i 


s — 1 Iff 

~K^1 + V 2 6 ’ 


then Do is not locally identifiable with respect to Lj v(D), with the same probability bound in the 
second part of Theorem [I]. 

Corollary 3. Under the same assumptions of Theorem [3| 

1. (Sufficient condition for BG(p ) models) Let k = \p(K — 1) + 1]. If 


Mfc(Mo) < 1 -p- 



then Do is locally identifiable with respect to Ljv(D), with the same probability bound in the first 
part of Theorem [|. 
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2. (Necessary condition for BG(p) models) Let k = p(K — 1). If 

PM M o ) > 1 ~ p + \j\ e ' 

then Do is not locally identifiable with respect to L^{ D), with the same probability bound in the 
second part of Theorem [ 3 ] 

Remark: As before, denote by p E [0,1) the coherence of the reference dictionary. The above 
two corollaries together with the remark under Corollary [l] indicate that the reference dictionary 
is locally identifiable with high probability for sparsity level s or pK up to the order 0(p~ 2 ). 


Proof sketch for Theorem [2] and [3j Similar to the population case, by taking one-sided 
derivatives of Ljv(Dt) with respect to t at t = 0 for all smooth {D^gR, we derive a sufficient 
and almost necessary algebraic condition for the reference dictionary Do to be a local minimum 
of Ljv(D). Using the concentration inequalities in Lemma A.l - A.3 we show that the random 


quantities involved in the algebraic condition are close to their expectations with high probability. 
The population results for local identifiability can then be applied. The proofs for the two signal 
generation models are conceptually the same after establishing Lemma 
to the II 


norm. The detailed proof can be found in Section 


A.2 


A.6 


to relate the 


norm 


Comparison with the proof by GS. The key difference between our analysis and that of GS 
is that we use an alternative but equivalent formulation of dictionary learning. Instead of ([2]), GS 
studied the following problem: 


1 

mm — 
De®,a, N 

subject to Xj 


N 


ElMi 

i =1 

= Da* for all i E [IV]]. 


(7) 


Note that the above formulation optimizes jointly over D and a* for i E [IV], as opposed to 
optimizing with respect to the only parameter D in our case. For complete dictionaries, this 
formulation is equivalent to the formulation in ([ 2 ]) in the sense that D is a local minimum of Q 
if and only if (D ,!) -1 [xi,..., xjv]) is a local minimum of ([ 7 ]), see Remark 3.1 of GS. The number 
of parameters to be estimated in Q is (.K — 1 )K + KN, compared to (K — 1 )K free parameters 
in ©>• The growing number of parameters make the formulation employed by GS less trackable to 
analyze under a signal generation model. 

GS did not study the population case. In their analysis, GS first obtained an algebraic condition 
for local identifiability that is sufficient and almost necessary. However, their condition is convoluted 
due to its direct dependence on the signals Xj’s. In order to make their condition more explicit in 
terms of dictionary atom collinearity and sparsity level, they then investigated the condition under 
the Bernoulli-Gaussian model. During the probabilistic analysis, the sharp algebraic condition was 
weakened, resulting in a sufficient condition that is far from being necessary. 

In contrast, we start with probabilistic generative models. The number of parameters is not 
growing as N increases, which, allows us to study the population problem directly and to apply 
concentration inequalities for the finite sample problem. There is little loss of information during the 
process of obtaining identifiability results from first principles. Therefore, studying the optimization 
problem ([ 2 ]) instead of ([7]) is the key to establishing an interpretable sufficient and almost necessary 
local identifiability condition. 
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5 Conclusions and future work 


We have established sufficient and almost necessary conditions for local dictionary identifiability 
under both the s-sparse model and the Bernoulli-Gaussian model in the case of noiseless signals 
and complete dictionaries. For finite sample with a fixed sparsity level, we have shown that as long 
as the number of i.i.d signals scales as 0(K log K), with high probability we can determine whether 
or not a reference dictionary is locally identifiable by checking the identifiability conditions. 

There are several directions for future research. First of all, here we only study the local 
behaviors of the 1 1 -norm objective function. As pointed out by GS, numerical experiments in two 
dimensions suggest that local minima are in fact global minima, see Figure 2 of GS. Thus, it is of 
interest to investigate whether the conditions developed in this paper for local identifiability are 
also sufficient and almost necessary for global identifiability. 

Moreover, one can extend our results to a wider class of sub-Gaussian distributions other than 
the standard Gaussian distribution considered in this paper. We foresee little technical difficulties 
for this extension. However, it should be noted that the quantities involved in our local identifiability 
conditions, i.e. the |||.|||* and |||.|||* norms, are consequences of the standard Gaussian assumption. 
Under a different distribution, it can be even more challenging to compute and approximate those 
quantities. 

Finally, it would be also desirable to improve the sufficient condition by Geng et al. (2011) and 


Gribonval et al. (2014) for over-complete dictionaries and noisy signals. One of the implications 


of our identifiability condition is that local recovery is possible for sparsity level up to the order 
0(/U 2 ) for a /i-coherent reference dictionary. We conjecture the same sparsity requirement holds 
for the over-complete and/or the noisy signal case. In either case, the closed-form expression for the 
objective function is no longer available. A full characterization of local dictionary identifiability 
requires us to develop new techniques to analyze the local behaviors of the objective function. 
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A Appendix 


Let L(D) be a function of D e T> and {Dj}*eR be the collection of dictionaries D t eD parameterized 
by t 6 M. By definition, {Dj}teR passes through the reference dictionary Do at t = 0. To ensure 
that Do is a local minimum of L(D), it suffices to have 


lim L(D,) ~ L(Do) 
*4.0+ t 


> 0 and 


lim UP,) - UDo) 

*to- t 


< 0 , 


for all (DtW that is a smooth function of t. On the other hand, if either of the above strict 
inequalities holds in the reverse direction for some smooth then Dq is not a local minimum 
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of L(D). 

Since Do is full rank by assumption, the minimum eigenvalue of Mo = D^Do is strictly greater 
than zero. By continuity of the minimum eigenvalue of D{ Di (see e.g. Bauer-Fike Thoerem), when 
and Do are sufficiently close, D^ should also be full rank. Thus without loss of generality we 
only need to work with full rank dictionary D^. For any full rank D£P, there is a full rank matrix 
A E H KxK such that D = DoA. For any k E pi], by the constraint ||D[, A:] ||2 = 1, the matrix A 
should satisfy A[, fc] r MoA[, k] = 1. Define the set for all such A’s as: 

A = {A E M. KxK : A is invertible and A[, fc] T MoA[, k] = 1 for all k E [A]}. (8) 


It follows immediately that the set {D 0 A ! A G A} is the collection of D G such thut D is full 
rank. Thus, to ensure that Dq is a local minimum of L(D), it suffices to show 


A+(L,{A«},) := lim L(DoA<)L(Do) > 0, 
40 + t 

A-(i.{A,},) := lim L(D ° Al) - L(Dl)) < 0. 
40 - t 


(9) 

( 10 ) 


for all smooth functions {At}t e u with At E A and Ao = I. In addition, to demonstrate that Do is 
not a local minimum of L(D), it suffices to have ([9]) or (10) to hold in the reverse direction for some 
{At}f with the aforementioned properties. We will be using this characterization of local minimum 
to prove local identifiability results for both the population case and the finite sample case. 


A.l Proofs of the population results 
A. 1.1 Proof of Lemma 13.11 

Proof. Since E||Hai||i = Y,f=i E l H b\ ]«i|> it suffices to compute E|H[j, ]ai|. Let S be any 
nonempty subset of [A]. Recall that the random variable Si C [A] denotes the support of 
random coefficient a\. Conditioning on the event {Si = S}, the random variable H[j, ]ai follows a 
normal distribution with mean 0 and standard deviation ||H[j, S] ||2 - Hence 

E|H[j,]ai|=E[E[|H[j,]ai||Si]] = ^/|e||H[j, Si]|| 2 . 

(1) Under the s-sparse Gaussian model, P(Si = S) = ({() for any |= s. Thus we have 

E||H[j,Si]|| 2 = ( K ) 1 Y, 11H [j, S'] 11 2 = vv|||H[j,]||| s . 

S:\S\=8 

Hence the objective function for the s-sparse Gaussian model is 



H[j, ] HI 


S' 


In particular, for s = K, |||H[j, ]|||^- = II H[j, ] || 2 and so 


L SG(s)( D ) 



2 - 
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(2) Under the Bernoulli (p)-Gaussian model, P(Si = S) = ^^(l — p) K So we have 

K 

E[l|H[j,S,]|| 2 ] = ^ XI /(l-P) A '-‘l|H[j,S]||2 
fc=lS:|S|=fc 
K -1 

= pY1 P binom (fc; ^ — ^^IIIHb'Jlllfc+i- 

k =0 

Therefore for p E (0,1), the objective function under the Bernoulli-Gaussian model is 

K r K 

LbgUD) = E E |H[i ) ]« 1 | = l/^EHTOJIp- 

3 =1 V J=1 

Finally, if p = 1, we have 

P2 K 

^ G (p,(D)=UEl|Hb-,]lb. 

v 0—1 


A.1.2 Proof of Theorem |T] 

Proof. (1) Let us first consider the s-sparse Gaussian model. By ([9]) and (10), to ensure that Do is 
a local minimum of LgQ( s )(D), it suffices to show 

A + (Lsg(s)) (A t } t ) > 0 and A ~(L SG ^, {A t }j) < 0, (11) 

for all smooth functions {At}j with A t E A , where A is defined in (|8j), and Aq = I. Note that by 


Lemma 3.1 


A+(L so(<) ,{A t }<) = yf |=E lim \ (|||A,- I b',]|||, - liny.Ill,) • (12) 

3 =1 


For a fixed j E [[A”]], we have 
'K - 1 


s — 1 


|A,-‘y 


|,= E ii A r 1 y.siib+ E ii A < _1 y.s]ii2 

S:\S\ms,j<ES S:\S\=s,j£S 


(13) 


Denote by Ao E M. KxK the derivative of {A^}t at t = 0. Since A t £ A for all t E M, it can be 
shown that 


By (14), we have 


Mo[, k] 1 A 0 [, k] = 0 for all k E [A], 

A o\j,j] = ~Y1 M o[i,j]A 0 [i, j] for all j E [A], 


(14) 


(15) 
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Now notice that 


dA 


-i 


dt 


— —A 0 1 A 0 A 0 1 — — A 0 . 


(16) 


t =o 


Combining the above equality with Lemma A. 12 and A.13[ we have 


lim i (||A t - 1 b',5]|| 2 -||Ib-,5]|| 2 ) = 


-A 0 [j,j] ifjeS 

||A 0 [i,5]|| 2 if j?S 


Therefore 


Combining (12), (13), (15) and (jT7j) , we have 


K — 1 
s — 1 


-i 


£ i|Ao[i,s]|i2. 

S:\S\=sjgS 


(17) 


7T K 

27 


if 


A + (L SG ( s ), (At}t) = ~y~] Ao[j, j] + 

i=i 


K — 1 

5 — 1 


-1 


K 


E (E M °[*7]A°b7] + 

J=1 


E E iiAob‘7]ii 2 

j S:\S\=s,j?S 

E HAo[j7]|| 2 

5:|5|=sJ05 


A - 1 
s — 1 


K 


A (L SG ( s ), {A t } t ) - E ( E M o[bi]A-o[i, i 


K - 1 
s — 1 


-l 


Similarly, one can show 
pjr K _ 

A {^SG(s), {A*}t) = 

j =1 

Thus for s E [A — 1], to establish it suffices to require for each j £ [A] 


£ I|Ao[3,S]|| 2 ). 

S:\S\=s,j£S 


E M o[b7']A 0 [j, i] 
i¥=j 


< 


A -s (K- 2 


K — 1 V s — 1 


-i 


E nEb> 

S:|S|=sJ0S 


A - s 


2 = 


A - 1 


Ao[j,-j] • (18) 


for any Aq such that Ao[j, — j] p 0. Since Ao[j 7] is a free variable for i / j, (18) is equivalent to 


M 0 [-J,j] r w 


< 


K-s 


A - 1 ’ 

for all w £ M A_1 such that |||w||| = 1. Thus by the definition of the dual norm, it suffices to have 

A - s 


! M o[—j, j] III* = sup M 0 [-j,j] T w 


II will „=1 


< 


K - 1 


Therefore, the condition 


Ml, „ r • -1 III* A — S _ S — 1 

max |||M 0 [-j,j]||L < —-- = 1 - 


1 <j<K 


A - 1 


A - 1 


( 19 ) 


is sufficient for Dq to be locally identifiable with respect to the objective function Lg G ( s ). 
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Similiarly, one can check that if the reversed strict inequality in (19) holds, Do is not a local 
minimum of Lsg(s)(D). Thus we complete the proof for the s-sparse model. 

(2) Now consider the Bernoulli(p)-Gaussian model for p £ (0,1). First of all, note that we have 


7r 1 


2 p 


A ± (L BG{p) ,{A t } t ) = - t (|||A £ 1 fillip - \\\l[j 


3 =1 


\K-2-k 


K , A'—2 

^ j]Ao[?,i] ± ~P) X! P fc (! -P) 

j=l fc =0 

K , K -2 

E ( A 0 [j,j] ± (1 -p) ^ pbinom(t; K - 2,p) Ao[j,-j] 


E IIAoy.SlIb 

S:|S|=fc+lJgS 


j=l 

a: 


k=0 


fc+1 


( Aob', -j] T M 0 [-j,j] ± (1 -p) 

j=i 


Ao[j, -j] 


Thus, similar to the s-sparse Gaussian case, it can be shown that a sufficient condition for local 
identifiability is 

|M 0 [—j, j] T w| < 1 -p, 

for all j £ [it'll an d all w £ 


1 such that |||w||| = 1. The above condition is equivalent to 
max |||M 0 [-j,j]|||p < 1 - p. 

The rest of the proof can be proceeded as in the case of the s-sparse Gaussian model. 

(3) Now let us consider the non-sparse case where s = K or p = 1. In this case, since the objective 
functions are the same under both models (see Theorem [I]), we only need to consider the s-sparse 
Gaussian model. If s = K, the RHS quantity in Inequality (18) is zero. Thus, the reference 
dictionary is not locally identifiable if 


M o[-j, j] T w 


> 0 , 


for some j £ [it'] and w £ M A_1 . Thus, if Mo is not the identity matrix, or equivalently, if the 
reference dictionary Do is not orthogonal, Do is not locally identifiable. 

Next, let us deal with the case where Do is orthogonal. Let D £ V be a full rank dictionary 
and W = D _1 . Since Do is orthogonal, ||W[j, ]Do|| 2 = ||W[.7, ][I 2 - By the fact that WD = I and 
||D[, j ]||2 = 1, we have 1 = W[j,]D[, j] < ||W[j, ]|| 2 ||D[, j]|| 2 = ||W[j,]|| 2 , where the equality holds 
iff W[j,] T = ±D[, j]. 

Under the J\-sparse Gaussian model, 


Asg(at)( d ) 



AT 


En w b>] D o 

3 = 1 


2 



-K — L SG ( K -)( D 0 ) 


where the equality holds for any D such that D 7 D = I. Thus, L SG (A')(Do) = Asg(a:)(DoU) for 
any orthogonal matrix U £ M . KxK , i.e. the objective function remains the same as we rotate Do- 
Therefore, Do is not a local minimum of Lsg(A')- 

In conclusion, Do is not locally identifiable when s = K or p = 1. 

□ 


24 


















A.2 Proofs of the finite sample results: Theorem [2] and Theorem [3] 

Proof. We will first recall the signal generation procedure in Section [2] Let z be a it-dimensional 
standard Gaussian vector, and £ E {0, 1} K be either an s-sparse random vector or a Bernoulli 
random vector with probability p. Let zi, ...,zjy and £i, ...,£jv be identical and independent copies 
of z and £ respectively. For each i E [iV] and j E [A], define a,;[j] = Zj[j]^,;[j]. For S C [if] with 
1 < 151 < if — 1, define 


Xi(S) 


1 if £i[k\ = 1 for all k E 5 and £$[&] = 0 for all k £ S c , 
0 otherwise. 


On the other hand, if S = [if], define Xi(S) = 1 if^[fe] = 1 for all k E [A"] and X i(S) = 0 otherwise. 
As in the population case, in the following analysis we will work with full rank dictionaries D. First 
of all, notice that 

K K K / 

hD,x 4 ) = ||D- 1 x l || 1 = ||D- 1 D 0 a l || 1 = ^|A- 1 [j,]a i |=^^ X ^[j, S}z i [S}\ Xi (S) 

3=1 3=1 k= 1 \S:|S|=fc 

Next, we have 



A + (/(.,x ? ;), {A t }t) = lim - (i(D 0 A t ,Xj) - Z(D 0 ,Xj)) 

40 + t 

K , K 

= X ( “ X X [j, j} Iz i [j] I Xi (S) 

3=1 ' k=l S:j£S,\S\=k 
K 

- sgn(zj[j-])^ X X ^o[j,l]zi[l]Xi(S) 

k =2 S:jeS,\S\=kleS,l^j 

K -1 

+ X X |Ao[j','S , ]z i [5]|xi(5) 

fe=l 5:j0S,|5|=fe 



Here sgn(x) is the sign function of x E M such that sgn(x) = 1 for x > 0, sgn(x) = —1 for x < 0 
and sgn(x) = 0 for x = 0. By (15), the first term in (20) can be rearranged as follows 

K K K K 

-Xl^llX X A 0 |>',j]xi(S) = Xl z ili]lX ^ X^MAoMxiOS) 

3=1 k=l S:j£S,\S\=k j=l k=l S:j&S,\S\=k l^j 


K 


K 


££M„tM]A„b,;] K[(]|^ ^ Xi (S) 


3=1 ¥3 


k =1 S:lGS,\S\=k 


The second term in (20) can be rewritten as 


K K 

X s s n ( z *b'D x X( A °k x X X w(5). 

l=i f# fe=2S:{j,;}es,|S|=fe 
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For j, l E [A] such that j / l, define the following quantities 


K 

F i [l,j] = M 0 [j,l]\ Zi [l}\Yl E Xi(S), (21) 

k= 1 S:leS,\S\=k 
K 

G i[l,j] = sgn(zj[j])zj [l] y. e Xi(S), (22) 

k=2 S:{j,l}£S,\S\=s 


whereas F \j,j\ = G[j,j] = 0. For each j E [[A]. also define 

t i[j]( w ) = E E l w [ 5 '] Tz *[5’]|Xi(5’)- (23) 

k =1 S:j£S,\S\=k 


Let F, G and t be the sample average of Fj, G,; and t* respectively. With the definitions (21) 
(23), we have 


A+(L„,{A,} t ) = 1?E A + «..x.),{A,} 1 ) 

i— 1 

K N 

= E ^ E (Ao[3.]Ft[,j] + Aob,]Gi[,j] +ti[j](A 0 [j,])) 

1=1 i =1 

K 

= E (Ao[j,]F[,j] - A 0 [j,]G[,j] + t[j](A 0 [j,])) 

l=i 


On the other hand, 


K 

A _ (L A r,{Ai} t ) = ^ (A 0 [j,]F[,j] - A 0 [j,]G[,j] - t[j](A 0 [j,])) . 
i=i 

Now for j E [A], s 6 [If - lj and p E (0,1), define 

£j(s) = {w E R K , |||w[-j][|| s = 1, w[j] = 0}, 

Tj(p) = {w E R*, [||w[-j][[| p = l,w[j] = 0}. 

Thus to ensure that Do is a local minimum, it suffices to have for each j E [A], 

Hj{ w) := |w t F[, j] - w t G[, j]\ - t[j](w) < 0, 

for all w E £j(s) for the s-sparse Gaussian model or all w E J’j(p) for the Bernoulli(p)-Gaussian 
model. 

(1) For the s-sparse Gaussian model, let j E [A] and define 

M w ) = \/f^ (|w T M„[.i]| - -EE), 
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which can be thought of as the expected value of Hj( w). Note that by triangle inequality, 


sup \Hj(w) — hj( w)| 

w efj (s) 


< sup 

wefj (s) 


w 




'H.il - H,j] 


+ sup |w r G[,j]| + sup 

wEfj(s) we £j(s) 

t[j](w) - 


+ |||G[-j,j]|||* + sup 

W &£j(s) 


tbl(w) - 
2 s K — s 


2 s K — s 
7T K K - 1 


(24) 


ir K K — l 

Thus, sup wg £v( s ) \Hj(w) — hj (w)| > implies at least one of the three terms on the RHS is greater 
than Using a union bound and by Lemma A.1-A.3 we have 


sup \Hj(w) — hj( w)| > > < 2A'exp (— 

w eSj(s) K 


Ne 2 


V 108A||M 0 [-j,j]||oo, 

Ne 2 


or- ( S Ne 2 \ 

+ 2 A exp — —-= 

V A 18(s/AT)s + 9\/2sJ 

/hats \K / „ *r_2\ 


/ 24A . 

+ 3 ( ——hi) exp 


s Ne 2 \ 
~K 360 ) 


(25) 


It is easy to see that the event |sup wg £v( s ) \Hj(w) — hj( w)| < j^ej implies 


s s 

sup hj( w) — —e < sup Hj( w) < sup hj( w) + —e. 

wG£j( s ) A w £Sj(s) wG (s) A 


(26) 


On the other hand, 


sup hj( w 
wGO( s ) 


= Jf ^ (» M oH, 


A-s' 


A - 1 

Thus, if |||Mo[— J, j]|||* < su Pwg£j(«) ^j( w ) < 0 except with probability at most the 

bound in (25). To ensure Dq to be a local minimum, it suffices to have sup wg £ Hj( w) < 0 for 
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all j <E {KJ. Thus, if |||M 0 [-j, j]|||* < - y/\e for all j <E [AT], we have 


P{Do is locally identifiable} > P < max sup Hj( w) < 0 > 

[ i w e£j(s) J 

> 1 — P < max sup Hj( w) > 0 

I i w e£j(s) 


>-e 4 

j =i l 

‘-E'j 

3=1 l 


sup Hj( w) > 0 

w eSj(s) 


sup \Hj(w) - hj( w)| > — e 


we£j (s) 


> 1 — 2 K 2 exp ( — 


Ne 2 


— 2 K 2 exp ( — S 


108A"max^- |M 0 [Z, j}\ 
Ne 2 


K 18(s/K)s + 9\/2s / 


-3A 


/ 24/P 


- 


es 


a: 


+1 exp - 


s Ne 2 
K 360 


On the other hand, to ensure Do is not locally identifiable with high probability, it suffices to have 
|||Mo[-i,j]||i: > IfEf + y^e f° r some j ^ [AT]. Indeed, under that condition, the LHS inequality 
in (26) implies sup wg £.( s ) Hj( w) > 0. Therefore 


’{D 0 is not locally identifiable} > P < sup Hj{ w) > 0 

I we£j( s ) 


> 1 — P < sup Hj( w) < 0 > 

(wdj(s) J 

> 1 — P < sup \Hj(w) — ltj(w)| > —e 

[weOW K , 

> 1 — 2AT exp ( — 


iVe 2 


— 2K exp — 


108 AT 11M 0 [—j, j ] 11 oo 
Ne 2 

K 18 (s/K)s + 9 V 2 sJ 




f2AK \ K ( s Ne 2 

- 3 (ir + 1 ) exp (-K 3 w 


(2) For the Bernoulli(p)-Gaussian model, define 

Uj (w) = \J^p (|w i M 0 [,j]| - (1 -p)) . 
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Similar to (24), by triangle inequality, 


sup \Hj(w ) — Uj( w)| 


< 


- W-pM 0 [-j, j] 


+ |||G[-i,j]|£ + sup 

vr&Tjip) 


tb'Kw) - \ -pi 1 -p) 


Then the analysis can be carried out in a similar manner using the parallel version of the concen¬ 
tration inequalities, i.e. Part 2 of Lemma A.l A.3 


□ 


A.3 Concentration inequalities 

We will make frequent use of the following version of Bernstein’s inequality. The proof of the 


inequality can be found in, e.g. Chapter 14 of Biihlmann and van de Geer (2011) 


Theorem A.l. (Bernstein's inequality) Let Y \...., Yjy be independent random variables that satisfy 
the moment condition 


E Y™ < ^ x V x m! x B m ~ 2 , 


for integers m > 2. Then 


, {^ | £ y <- Ey ‘i >e } ^ 2ex p 


Ne 2 


2V + 2 Be 


Lemma A.l. (Uniform concentration of F\—j, j]) For i E [[IV]], let F, E W KxR be defined as in 
(2$ andF = (l/N)El 1 F l . 

1. Under the s-sparse Gaussian model with s E \K — lj, 

2 s . . , . si f Ne 2 


F[- hj] ~ xl-j-Mol-jJ] 


> —e > < 2 K exp (- 

K I - V 12A||M 0 [-j, j] 


for 0 < e < 1. 

2. Under the Bernoulli-Gaussian model with parameter p E (0,1), 

> pe > < 2 1\ exp ( — 


F [~j,j] - \/-pM 0 [-j,j] 


Ne 2 


12(K + 2p-i)\\M 0 [-j,j] 


for 0 < e < 1. 

In particular, if ||Mo[— j, j]||oo = 0, then the RHS bound is trivially zero. 

Proof. (1) First of all, we will prove the inequality for the s-sparse model. Notice that by Lemma 


3.2, we have 


2 s 




2 s 


<i max^HF^j] - V -~U M o[ s J]h 

\S\=s,j£S V 7T K 

2 s 


<Vsmax|F[Z,j] - W M 0 [l, j]\. 
tyj V vr A 
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For convenience, define 


vi[/] = Wi]|£ Xi 

fc=i |s|=fc,zes 


for i <E [IV] and l G [if]. Note that X)fe =1 EzeS,|S|=fcXi(-S') = 1 with probabifity (^) 

Thus 

( \ m 

E E »(«)) =£• 

fc=i |5|=fe,/e5 / 

For m > 1, by Jensen’s inequality < i(|a| m + |6| m ) and E|Z| m > (E\Z\) m = 2 , where 

Z is a standard Gaussian variable. In addition, E|Z| m < (m — 1)!! < 2 __2 ~m!. Hence 


E|v i [/]r<2 - 1 ^E|z i [Z]r + 

< 2 x E|Z| m x 2 m_1 


2 \ 2 /s\™ 
7T / VAli 


< 2 x 2 ml x 2 m ~ 1 


= - X — x ml x (\/2) m 2 . 
2 K y ' 


Thus by Bernstein’s inequality, we have 


II" ST p j + &))' 

Therefore, 

P <f Mq b', /] — 'S~' Vj[Z] > — el < 2 exp (— — - -= -^ 

1 K S~ V iv 2(4M 0 [j, l] 2 + a/2 |M 0 [j, l]\e)) 

^ 0 / s Ale 2 \ 

- ^ Cxp \ if 2 |M 0 [j, I]| (4 + \/2e)/ 

/ s Ne 2 \ 

- 6XP V ifl2|M 0 [jJ]|J- 

for e < 1. Notice that if Mq[j, l] = 0 the LHS probability is trivially zero. Using a union bound, 
we have 


\ F [~jJ] - J 2 ^M 0 [-j,j]||oo > > =P< max|M 0 [j,Z]-*-^v*[Z]| > e 


S2A '“ P Gifl2||M„H,i] 


30 



Therefore 


2 s 


F l-3J}-\/-K M o[-3,j] 


> j^e < P <( \/s||F[, j] - \j z — M 0 [, j]||oo > if 6 


2 s 


< 2 1\ exp ( — 


7r K 
Ne 2 


K 


12K\\M 0 [-j,j]\\ 

(2) Now let us consider the Bernoulli-Gaussian model. Notice that by Lemma 
we have 


< 


A.6 


for 


K- 


F [~j,j] - \l -pM 0 [-j,j] 


F ~ \l -pMo[~j,j] 


< Vs 


F [~j,j] ~ \ -pM 0 [-j,j] 


Now let s = \pK — p + 1] < pK + 2. For i G [IV] and l G [iV], define 

K 


Ui[l] = \zi[l]\J2 Xi{S) 

k =1 |S|=s,ZeS 


p. 


Note that the event j^fcLi X)|S|=fc les Xi{S) = l| is the same as the event that [Z] = 1}, 
happens with probability p. Thus 


K 


e IE E *(S)| =p. 

ik =1 \S\=k,leS 


Similar to the case of s-sparse model, 

E |uj[7]| m < - x 4 p x m\ x ^\/2^ 

By Bernstein’s inequality, we have 

N 


m —2 


vX>i'i 


1=1 


> e > < 2 exp — 


JVe 2 


2(4p + \/2e) 


Therefore 


F[-J, j] - y^pM 0 [-j,j] > pe > < P I %/s||F[, j] - y|pMo[,j]||oo > pe 


< 2K exp ( —- 


JVe 2 


< 2A' exp ( — 


s 2||Mo[— j, j] ||oo(4 + \/2e) 
Ne 2 


V 12(ir + 2p-i)||MoH,j] 


for e < 1. 



which, 


□ 
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be defined as in 


Lemma A. 2. (Uniform concentration of G[—j, j]) For i E [[IV]], let G j E 


»KxK 


(2$ and G = (1/A) ^G,. 

1. Under the s-sparse Gaussian model with s E \K — lj ; 


G[-j,j]||L > < 2Aexp - 


| < 2K exp 


Ne 2 


' s - K 

for 0 < e < 1. 

2. Under the Bernoulli-Gaussian model with parameter p E (0,1), 


K 2 (s/K)s + V2s) ’ 


{|||G[—J,J']|||* >pe] 


> pe> < 2 K exp — p 


Ne 2 


p(pK + 2) + \j2fipK + 2) j ’ 


for 0 < e < 1. 


Proof. The proof is highly similar to that of Lemma A.l and so we will omit some common steps. 
(1) We first prove the concentration inequality for the s-sparse model. Notice that 


G[—j, j] < -v/smax|G[Z,j]|. 


In addition, 


K 


K 


"ME E x.(s) =e E E *(s) 


, k=2 {j,l}eS,\S\=k 


k=2 \S\=k,{j,l}eS 


K\ _1 / K - 2 


s(s~l) <( ± ) 2 


Thus 


s-2) K(K-1 )~ k K j ' 


E|G,[i,j]|’" < 2-"*/ 2 m! X (f) 2 = 1 X (f) 2 X ml X (A)"- 2 . 


By Bernstein inequality: 


N 


N 


^2 G i[l,j}\ > e \ < 2exp ( - 


2—1 


Ne 2 


2(s/ K) 2 + V2e) 


Thus we have 


{High,/ 


* 5 

s > K e 


| <P j-v/smax|G[Z,j]| > — e 


< 2 K exp ( — 


s 

K 

(s / K) 2 N (e 2 / s) 


< 2K exp — 


< 2K exp — 


2(s/K) 2 + V2(s/K)(e/^~s)) 
s Ne 2 

A 2 (s/K)s + \J~2 se 
s Ne 2 


K 2(s/K)s + y/2s) 
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for e < 1. 

(2) For Bernoulli-Gaussian model, notice that 


\\\G[-j,j]\\\l < |||G[—j,j]HI* < v / smax|G[/,j]|, 


¥1 


for s = \pK — p + 1] < pK + 2. Also, 


Thus 


E|G < 2- m / 2 rn! xp 2 = -xp 2 xm!x (4=) m_2 . 

2 \/2 


|G[-j,j]|||* >P e } < P|^m|Dc|G[Z, j]| >pe 

jV(e 2 ) 

< 2Aexp(—p- -=) 


2 ps + y/2s 


< 2Ji exp —p 


M 2 


p(pA + 2) + yj2(pK + 2) 


for e < 1. 


□ 


Lemma A.3. (Uniform concentration of t[j] (w) ) For i G [A], let t,; 6e a function from M A to M A 
defined as in (23) and t = (1/A) YliLi Recall that for j G [A], s G \K — 1] and p G (0,1), 

£j(s) = {w G R k , |||w[-j]||| s = 1, w[j] = 0}, 

Fj(p) = {w G M A , |||w[-j]||| p = 1, w[j] = 0}. 

1. Under the s-sparse Gaussian model with s G \K — 1]], 


sup |t[j](w) \/ 


2 s A — s , 


7T A A - 1 1 A 


8 A 


> T 7 e > < 3-h 1 exp - 


es 


I we£j(s) 
for 0 < e < g • 

2. Under the Bernoulli-Gaussian model with parameter p G (0,1), 

8 


if 


s Ne 2 
A^LO" 


sup |t[j](w) - \ -p( 1 - p)| > pe > < 3-hi exp —p 


(w&Tjip) 

for 0 < e < g. 

Proof. (1) Under the s-sparse model, we have 


ep 


K 


Ne 2 

lo" 


E|t,[j](w)| m = E Y |w[S] T z 4 [S]| Xi (S) 
\l S\=sj#S 

= Y E|w[5] T z45]| m E Xi (5) 

\S\=s,j?S 
-1 


A 


Y E|w[5] T Zi[ 


\S\=sj#S 
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Notice that we have used the facts that the events Xi(S’)’s are mutually exclusive and that z,[5] 
and Xi(S) are independent. Since the random variable wfS'f 1 z^S 1 ] has distribution iV( 0 , ||w[S]|| 2 ), 
E|w[S] T z 4 [S]| m = ||w[5]||^E|Z| m < 2~f tn!. Therefore 


E|t,b-](w)r < 2-f ml 


I< 


-l 




m 
2 • 


j£S,\S\=s 


Note that by Lemma A.5 |||w[—j]||| s > ||w[— j ]\\2 > ||w[S ']|| 2 for all S such that j 0 S. For 
w E £j(s), HIw||| s = 1 and so 11w,s'11 2 < 1, which, further implies that ||w[5]||™ < 11w[S']11 2 - Thus we 
have 


E|tj \j\ (w) | m <2 2 m! 


K 


-1 


X] W w i S ]W- 


j$S,\S\=s 

0 _ 2 ii . s(K — s ) ... r 

<2 2 m! ——— -- w — 7 


= 2 2 ml 



For a fixed j, define 

Ui{ w) = tj[j](w) - 
Notice that EC/j(w) = 0. In addition, 

E|^(w)| m <2 m E|t i [j](w)r < \ x 4^^^ x m\ x 


By Bernstein’s inequality 


{iriE^wi ^}- 2 


Ne 2 


S zexp 


K 2(4§ff + V2e) 


< 2 exp — 


s Ne 2 

k~lo~ 


for 0 < e < 1/2. Now let {wj} be an d-cover of £j(s). Since £j(s) is contained in the unit ball 


{w E 1 : 11w11 2 < 1}, there exists a cover such that |{w^}| < (| + l) A 1 . For any w,w' E £j(s ), 
we have 

\Ui(-w)-Ui(w')\ < 22 |( w [' s '] - w/ [S]) Tz i[51| Xi(S). 

j$S,\S\=s 

Let Z be a standard Gaussian variable. We have 


22 | w [<S'] T Zj[-S , ]| Xi(£) > e 

\S\=8jtS 


= 


K - 1 
s 


-1 


22 p {l w [' s ] Tz *[ 5 ]| > e ) 

\S\=s,j<£S 


K — 1 
s 


-1 


E n*{iiw[s]Nzi>a 

I S\=s,j?S 

<P{||w|| 2 |Z| >e}. 
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Let Zj, i = 1, be i.i.d standard Gaussian variables. By the one-sided Bernstein’s inequality, 


■£>| > 2 } < exp (- < exp 

h i \ 2(4 + V2(2-v^A)J V Sj 


Now let <5 = ^|. Thus 


1 A " 

sup _?7 *( w/ ))i 


|w-w'|| 2 <<5 i=1 


sup -V 

K 2 J | ||w'-w|| 2<5 j =1 


1 A ' 

sup N7 H “ ^( w ')| > 


f 1 W 

£P liiwT4< s ivg“ w - w ' l ' 2 ' Zil> lrl 


AT 7^ 1^*1 > 2 


< expf-^. 


By triangle inequality 


N N N 

sup li£ c/ ’ i ( w, )l- su p 4X^( w ) - ^(w'))! + 


||w'-w||2<<5 

Using a union bound, we have 


||w' —w|| 2 <<$ N i = l 


P{ sup - 

||w'-w|| 2 <<5 


1 N ( 

>r)£ p 

i 1 L 


Sup AT ~ U i( w ')) > 

w— w'|| 2 <5 i=1 


/ iV\ /a Ne 2 \ 

<exp(-g +2ex p l--—I 


( s JVe 2 \ 

- 3exp “Kl(T ’ 


for 0 < e < 1. Now apply union bound again, 


P < sup — | ) 
\w e£ j (s) N t 


yx>(w)i>y £ }<p{ 


Sr max sup —1> 

\ 1 iiw-wiii. 2 <<5 -W i=1 

(8K \ K ( s Ne 2 

<A — + i) «p(- F1 „- 


AT 'I 

5Z^( w )i > 44 

i=i j 
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(2) For w £ under the Bernoulli-Gaussian model: 

K—l 

E|t,[j](w)r = ei zr E E ii w ^ii" x p h ^-p) K ~ k 

k=\ \S\=k,j£S 
K—l 

< ~¥i\Z\ m p E E l|w[S]|| 2 -p) A '- fc 

k=1 \S\=k,j<£S 
K -2 

= E\Z\ m p(l -p) E E || w [<5]11 2 X p k (l-p) K - 2 ~ k 

k =0 ,V|- A.- l.jgS 

= E\Z\ m p(l -p)|||w[—j]||| p = E|Z|>(1 -p) 

< 2~ m ^ 2 m\p(l — p). 

Notice that we have used the fact that ||w[5]||2 < ||w[— j]11 2 < ||| w [~j]||| p = 1 for all S such that 
j 0 S. For each fixed w, define 


Vi (w) = tj[j](w) 



(1 - p)p. 


Now we have 

E|F*(w)| m <2 m E|tj[j](w)| m < X - x 4p(l -p) x m\ x {i/2) m ~ 2 . 


The remaining parts of the proof can be preceeded exactly as in the case of the s-sparse model, 
noticing that we only need to replace by p, and by 1 — p. 


□ 


A.4 Dual analysis of 


and 


In this section, we will characterize the dual norms |||.|||* and |||.|||* by second order cone programs 
(SOCP). The characterization is helpful for deriving bounds for these special norms in the next 
section. 


Lemma A.4. For i £ [M], let A, be an x K with rank k{. For z £ define 


M 

z||a = E l! A * z ll2- 

i=l 


Then the dual norm of ||.||a is 


M 


inf 


max ||y,:|| 2 , y» € M fci , E A fy* 

n • 


i= 1 



Proof. 


T 

i n* v z r 

|V|| A = SUP 77 77 — = sup { 
z^O 11 z 11A 


T 

V z 


|Z|| A 


<!}• 
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Introducing Lagrange multiplier A > 0 for the inequality constraint, the above problem is equivalent 
to the following 


The dual problem is 



Notice that || A ? ;z 11 o = sup{z T A^Uj : || u* ||2 < !}• Hence 


f M \ 

d = inf < A + sup |z T (v - A V' Afu*) : 11u*11 2 < l} > • 

A>° [ z,u t i=1 J J 

Since the vector z can be arbitrary, in order to have a finite value, we must have A A^u= v. 
Now let y,; = Au,, the problem becomes 

d = inf {a:£a^v, ||y,-|| 2 < a| . 

The above problem is exactly equivalent to 

inf |max||yi|| 2 ,yj G A/y, = v| . 

Finally, notice that the original problem is convex and strictly feasible. Thus Slater’s condition 
holds and the duality gap is zero. Hence 


Ml A = inf { nrax 


M 'I 

MIMy* g R ki ,Y^ Afy* = v > • 
i =1 J 


□ 


The following corollary gives an alternative characterization of |||.||| s and |||.||| p : 

Corollary A.l. Denote by y 5 G iMI a variable vector indexed by the set S (as opposed to being a 
subvector of y). For z G M m , we have 


Mils = inf ^ max ||y 5 || 2 : y 5 G M s , E^y s = z ^ , 
|5| “" \S\=8 


and 


m— 1 


M||* = mf \ nrax ||y 5 || 2 : y 5 G M |51 , ^ pbinom(fc; m - l,p) ^ E^y 5 = z , 

|5|=fc+i 


k= 0 


where E 5 = I [ 5 , ]/(|s|_\) and I G M mxm is the identity matrix. 
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Proof. This is simply a direct application of Lemma A.4 


□ 


Corollary A.2. The dual norms |||.|||* and |||.|||* can be computed via a Second Order Cone Program 
(SOCP). 

Proof. Introducing additional variable t > 0, the problem of computing |||z|||* is equivalent to the 
following formulation 

inf t s.t. ||ys|| 2 < t for all S such that \S\ = s 
t,ys 

and ^ E Iys = z 
|5|=s 

Notice that the above program is already in the standard form of SOCP. The case of |||.|||* can be 


handled in a similar manner. 


□ 


A.5 Inequalities of |||.||| s and |||.||| and their duals 

As demonstrated in the last section, it is in general expensive to compute |||.|||* and |||.|||*. In this 
section, we will derive sharp and easy-to-compute lower and upper bounds to approximate these 
quantities. 


Lemma A.5. (Monotonicity o/|||z||| s and |||z||M Let z G 


|z111= 11z11 1 and |||z||| m = 11z11 2 - For 


1 < l < k < m, we have |||z||| ; > |||z||| fc ; similarly for 0 < p < q < 1, |||z||| > |||z||| . Furthermore, 
the equalities hold iff the vector z contains at most one non-zero entry. 

Proof. By definition, we have 


Similarly, 


£| S |=1 ll w [S]ll2 

(T-i) 

E|S|=mll W [ 5 ]ll2 

(m-l\ 

\m— 1/ 


= w h. 


— w 2- 


For 1 < k < m, — 1, let S' be a subset of [m] such that |5'| = k + 1. By triangle inequality 

E IWS]ll2>fc|Ns']|| 2 , 

\S\=k,ScS' 

where the equality holds iff ||z[S"]||o < 1. Thus 

E E H s ni2>* E ii z [ s 'ni2. 

|S'|=fc+l \S\=k,ScS' |5'|=fe+l 

and the equality holds iff ||z||o < 1. Notice that the LHS of the above inequality is simply (m — 
k ) T,\s\=k Therefore 

l:0"Eii*»2(Vr E HS]||2 = IINIU +1 . 

7 \S\=k V 7 |S|=fc+l 
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and so the inequality holds. 

For |||.||| p , let Y be a random variable that follows the binomial distribution with parameters 
m — 1 and p. Observe that |||z||| p = E|||z|||y +1 , where the expectation is taken with respect to Y. If 
|jz11o > 1, |||z||| fc is strictly decreasing in k by the first part. Hence, |||z||| as a function of p is also 
strictly decreasing on (0,1). Indeed, it can be shown that 


d_ 

dp 



771—1 

Pbinom(£:;/i - 1 ,p) (|||z||| fc+1 

k =0 


fc) < o. 


if IMIo 
On the 


< 1, then Hlzjl^ = |||z||| m and so j||||z||| p = 0. Therefore |||z||| p 
other hand, if |||z||| p = |||z||| for 0 < p < q < 1, by the fact that 
= 0 and so |||z||| fe _ 1 = |||z||| fe for all k 6 [mj. Thus ||z||o < 1. 


A 

dp 


z||| x is a constant 
zHI < 0, we must 


in p. 
have 

□ 


Corollary A.3. (Monotonicity o/|||z|||* and |||z|||*J Let z 6 8™. |||z|||* = UzH^ and |||z|||* r = 11z11 2 - 
For 1 < i < j < m, we have |||z|||* < |||z|||*; similarly for 0 < p < q < 1, |||z|||* < |||z|||*. Furthermore, 
the equalities hold iff the vector z contains at most one non-zero entry. 


Proof. This is a direct consequence of Lemma 


A.5 


and the dual norm definition 


z = 


sup y^° fyf 


□ 


Lemma A.6. Let p £ (0,1) and k = \(m — l)p + 1]. For any z G M m ; we have 
1. lllzllL > lllzllL. 


2 . 


up — 


Mllp < IINIIlfc- 


Proof. Define the function / with domain on [l,m] as follows: let /(1) = Hlz^ = ||z||i; for 
i G \m — lj and a G (i, i + 1], define 

/(«) = IIMIIi + (IIMIIm - lll z IIL)(« - *)■ 

It is clear that / is piecewise linear by construction. In addition, by Lemma A.10 / is also convex. 
Notice that |||z||| p = E|||z|||y +1 = E f(Y + 1), where Y is a random variable from the binomial 
distribution with parameters m — 1 and p. By Jensen’s inequality, 

E f(Y + 1) > f(EY + 1) = f((m - 1 )p + 1). 


Thus by Lemma 


A.5 


To upperbouncT 


for all k > (m — 1 )p + 1. So the first part follows. 


|||z||| > |||z||| fc 
z|||*, notice that if k > (m — 1 )p + 1, 


T 

w z 


T 

w z 


l z lllp = sup 

w^0 


w 


< sup 
w^O |||W|| 


= z 


0 


For the following lemmas, the quantities T m (d,a) and L m (d,k ) are defined as in Definition 3.1 
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Lemma A. 7. (Approximating T m (d, a) ) For d £ [m] and a £ (0, m\: 

Tm{d,a ) < 

Proof. For k £ [m], by Jensen’s inequality, 

E y/L m (d,k) < y/EL m (d,k) = 

Note that the last equality follows from the expectation of a hypergeometric random variable. Now 
suppose a £ (k — 1 ,k\. By the above inequality and apply Jensen’s inequality one more time, we 
have 

T m (d, a) = (k - a)Ey/L m (d,k - 1) + (1 - ( k - a))E v / L m (d, k) 

< (.k - a)\/ ( ^' —+ (1 — (k — a)) 

V m 

□ 





Lemma A. 8. (Lower bounds for |||z|||* and |||z|||p Let z £ M m . We have 
1. For s £ \m\, 


m\i 

m Tcjfm] T m (\T\,s) 


\z\\\ s > — max 


> max 



max 

TClmj 


■M )’ 


2. For p £ (0,1), 


P max < (y^pbinom(k,m,p)T m (\T\,k)] ||z[T]|| 1 
TC If rail V z —' / 


|Z > 

P ‘ Tc[ra] 


> p max 


k =0 

MU 


-= max ( 11 z 11 oo, y/p max 


mih 


rc[m] T m (\T\,pm) l|CXJ,v "Tc[ra] ^\T\ )■ 

Proof. (1) Note that by definition, 

T 

... * z w 

Z = SUP 777 - r 

W |||w||| s 

Let d £ [m] and T C [m] such that |T| = d. Define w £ M m such that w[i] = 1 for i £ T and 
w[i] = 0 for i £ T c . We have: 

-1 / 1 \ -1 min(s,(f) 


m — 1 
s — 1 


|5|=s 


2 — 


m — 1 
s — 1 

m — 1 
s — 1 


m — 1 
s — 1 


5Z ii w [ 5 iii- 

/=max(0, s+c£—m) | < S'|=s,|S'nT|=Z 
_1 min (s,d) 

E E ^ 

i=max(0,s+d—m) |5'|=s,|5nT|=/ 
min(s,d) 


E 

Z=max(0, s+d—m) 

m /———— r m 

= -EvLmM = —r m (d,5j 

5 5 


d\ fm — d 


VI 
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Thus for all d £ [mj and any subset T such that \T\ = d, we have shown 

III z in * > 

s “ m T m (d, s )' 

Note that if d = 1, E yjL m {d, s ) = Therefore 

* mu 


m Tm(d, s) 


— Halloo: 


Moreover, by Lemma A.7 


ds 

T m {d,S) < \ -. 


m 


Hence we have 


s Hziniu > nrimw. 


m T m (d, s) V m y/d 
and the first part of the claim follows. 

(2) For the same w 6 M m defined previously, 

m —1 

W w lllp = Pbinom(fc,m- l,p)|||w||| fc+1 


k =0 
m— 1 


= m 


T m (d,k+ 1) 


= m 


Y pbinom(A;, m - l,p) 
k =0 " + ~ 

m— 1 / n \ 

E 


k =0 
m—1 


m k 1 )p k (l-p) m - k - 1 k ^ 1 T m (d,k + 1 ) 


/ \ 

E (j.” jJZ+'d - p) m - (t+1) r m (d, k + 1) 

m / \ 

YY)p k a-p) m - kT m(d,k). 


Thus for all d 6 [m] and any subset T such that |T| = d, we have shown 


up — 


> p(^ pbinom (k, m,p)r m (d,A;)) ||z[T] || 


i- 


k =0 


Next, we will show 


pbinom(fc, m,p)T m (d, k ) < T m (d,pm). 


k =o 


To this end, let us first notice that the LHS quantity is a binomial average of T m (d, k ) with respect 
to k. By construction, T m (d, .) is piecewise linear. Furthermore, T m (d, .) is also concave by Lemma 


A. 11 Now let Y be a random variable having the binomial distribution with parameters m and p. 


By Jensen’s inequality, 

m 

pbinom (k,m,p)T m (d, k ) = E r m (d,Y) < r m (d,ET) = T m (d,rnp). 


k =o 
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In particular, if d = 1, it is easy to see that T m (d,mp) = p. So 


p max 
\Tc[m],|T| = 


i (X^Pbinom(fc,m,p)r m (|T|, k)j ||z[T]|| 


1 ^ llzlloo 


fc =0 


On the other hand, by Lemma A.7 


T m (d,pm ) < \ —y/pm = \fpd. 

\m 


Therefore 


p(X;p binom (A:,m,p)r ro (d,A : )) ||z[T]||i > y/p ^ T ^ 1 
k =o 


and the proof is complete. 

Lemma A.9. (Upper bounds for |||z|||* and |||z|||*j Let z £ 
1. For s £ \m\, 


□ 


2. For p £ (0,1), 


|z|||* < max ||z[S]|| 2 . 
s |S|=s 


MU* < max ||z[5]|| 2 , 
p |S|=fe 


where k = | \p(m — 1) + 1]. 

Proo f. To establish the upper bound, we will use the equivalent formulation of |||.|||* in Corollary 
A.l For S C If ml of size s, as in Corollary A.l, let E? = IfS 1 , where I £ M mxm is the 


identity matrix. If we set y s = z[S], then X]|S|=s EsYS = z and so {ys} is feasible. Therefore 

|||z|||* < max ||z [S'] ||2- 
s |S|=s 

The upperbound of |||z|||* follows from the inequality |||z|||* < |||z|||^ for k = \p(m — 1) + 1] by the 
second part of Lemma A.6| □ 

Corollary A.4. (1 -sparse vectors) Let z = (z, 0, ...,0) T £ M m . We have 


Fills = IIFIIIp = FI- 


Proof. These are direct consequences of Lemma A.8 and Lemma A.9 


□ 


Corollary A.5. (All-constant vectors) Let z £ M m be such that z[i\ = z for all i £ |[mj. We have 

L IIMIIs = >AM- 

2. |||z|||* = mp(^ Y^k=o pbin°m(fc, m,p)y/kj \z\. 
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Proof. First of all, note that L(m, k ) = k and E-y/ L(m, k ) = \/fc. Thus by Lemma 
have 

lll z llls = Vs\z\. 


A .8 


and 


A.9 


we 


So the first part of the claim is verified. Next, by Lemma A .8 


z|||* > mp( pbinom(fc, m,p)Vk 


k =0 


-1 


Z . 


On the other hand, for S such that \S\ = s, we can define 


y s = pbinom(fc, m,p)Vk ) (z ,..., z) T £ 


k =0 


For 


notation simplicity, let c = ^ ^ pbinom(&;, m,p)Vk^j. As in Corollary 


let E 5 = I[5, ]/(| 5 |_ 1 ) • For i £ [m], we have 


A.l 


for S C [mj, 


m—1 


m— 1 


^ pbinom(A:;m - l,p) ^ (Egy s )[i] = c 1 ^ pbinom(A:; m - l,p) 


fc =0 


|5|=fe+l 


fc =0 


aA~+T 


= c 1 -pbinom (k',m,p)Vk 

mp z —■' 

^ fc=o 


= z. 


Thus by Corollary |A.1[ 


|z|||* < max 11 ys’ 112 = mp( pbinom(L, m,p)Vk 


-1 


k =0 


and the proof is complete. 


□ 


Lemma A.10. (Convexity of |||z||| fc j Let z £ M m , where m > 3. For k £ \m — 2\, we have the 
following inequality 


l z lllfc + lll z lllfc+2 — 2 |||z||| fc+1 . 


(27) 


Proof. We will first show that the claim is true for k = m — 2. Notice that in this case |||z||| fc , 2 = 
III z IIIrri = Il z ll 2 - If II Z I|2 = 0 , the inequality (27) is trivially true. Now suppose ||z|| 2 > 0 , dividing 


both sides of the inequality by 11 z 112 , we have 

-1 


m — 1 
m — 3 


£ 


a 

ll z l|2 


+ 1 > 2 


771—1 
771 — 2 


-1 


£ 


Ngl 

ll z l|2 


|5|=m—2 " |S|=m—1 

Now let x = (aq, ...,x m ) T £ M m be such that Xi = z[z] 2 /11z111- It suffices to show 

/ \ V 2 


£ 5> 

\S\=m—2 \i(zS / 


+ 


(m — l)(m — 2 ) 


> (m - 2) v 7 ! 


- Xi, 


(28) 


i=l 
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for all x > 0 entry-wise such that Yi x i = 1 • We will now prove the above inequality by induction 
on m. First of all, notice that for the base case where rn = 3, we need to show: 

y/xi + y/x2 + y/x3 + 1 > \/l - X\ + y/l - X 2 + V 1 ~ * 3 , 


with the constraints Xi > 0 and x\ + X 2 + X3 = 1. For fixed X3, let 

f(x 1) = y/xi + \/l-Xl- X 3 + y/xi + 1 - Vxi + X 3 - y/l - X 1 - y/l - X 3 . 


We will show that f(x 1 ) is minimized at x\ = 0 or x\ = 1 — x 3 . Suppose now x\ > 0. Taking 
derivative with respect to x\: 

= _ 1 _ 1 _^ 1 ^ 

1 2 \/xT s/l-Xl- x 3 yjx 1 + X 3 y/1 - Xl 

Let l(x 1 ) = -/= — ^ X ^ +Xi ■ Note that f'(x 1 ) = \l{x 1 ) — |/( 1 — x 3 — xi). Now we have 

Z'(zi) = + x 3 )~ 3/2 - ^f 3/2 . 

So /(xi) is decreasing on (0,1 — x 3 ) and by symmetry the function /(I — x 3 — xi) is increasing on 

(0,1 — X3). On the other hand, since lim xi i 0 + l(x 1 ) = +00 and lirn xi 1 0 + 1(1 — x 3 — x\) = —00, we 

know that f'(x 1 ) > 0 on (0, and < 0 on ( 1 ~ X3 ,1 — X3). Thus, the minimum of / can only be 

attained at the boundaries, i.e. xi = 0 or xi = 1 — X 3 . In either case we have 


y/xi + y/x2 + y/x3 + 1 ~ \/l - ^1 - Vl ~ X 2 ~ y/l - X 3 
>V ^2 + y/xi - y/1 - x 2 ~ y/1 - x 3 = 0 , 


as X 2 + X 3 = 1. So we establish (28) for rn = 3. 

Suppose (28) is also true for rn = n — 1. For rn = n, similar to the m = 3 case, for fixed 
X3, x n , define 

/m= E (E^) 1/2 + (,1 ~ 1) 2 ( "~ 2) -(»- 2) Ev^ 

|S|=n—2 ' i£S ' 


~ Xi 


i= 1 


subject to Xi > 0 and Yi x i = 1- Again, we will show / attains its minimum at either xi = 0 or 
xi = 1 — Yi =3 x i- Notice that 


E (e*/) 1/2 = e (x 1+ e^) i,z + E o*+*» + e 


1/2 


|S|=n —2 jes 


\S\=n-3,l,2£S j£S 


\S\=n—4,1,2gS 


j£S 


1/2 


1/2 


*»+E*d + (E 

3 =3 

1/2 


+ E 

|5|=n-3,1,205 jeS 

n n 

x i ~ xi 

i =3 1=3 

n n 

+5^(1 - xi - xi ) i/2 + f y^x 


1/2 


i=3 


+ 2 (i-Zi-Zj ) 172 

3<i<j<n 
1/2 
3 


3 =3 
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In addition 


TL 71 1 /9 71 

5^(1 - Xi) l/2 = (1 - Xi) 1/2 + (xi + ^2 X i) + “ x i) 1/2 - 

3=3 


i=l 


i =3 


Taking derivative with respect to xi, 


^ / n n 

/'(*i) = o (Y ( xi+ Y x i 


- Xi 


- 1/2 


_ ^-1/2 _ ( n _ 


+ (n- 2)(1 - xi) 

Now let 

n n 

K x i) = Y{ xi + Y 

»=3 3=3 

So 2/'(xi) = Z(xi) - Z(1 - EILg Xj - xi). Again 


E(1 - Xl ~ x '*) 1/2 

,/ v- N" 1 ^ 
2 )^X 1 + } j Xj 

i =3 


- 1/2 


— n — 


2) ( x i + Y1 x i 

3=3 


- 1/2 


1 n n 

l '( x i) = -2Y( Xl + Y- 

i=3 j=3 


3!/ o 


-3/2 n - 2 
+ —w- 


X 1 + H : 


3=3 


-3/2 


It is easy to see that l'(x\) < 0 and so Z(xi) is decreasing on (0,1 — Y17=3 x ’* — x i)- On the other 
hand lim a . 1 j r o+ Z(xi) = +oo. By symmetry f'{x i) > 0 on ( 0 , 5(1 — ^” =3 Xi ~ xi)) an d < 0 on 
( 0 . 5 ( 1 -HU Xi — x 1 )). So / attains its minimum at x\ = 0 or x\ = 1 — ^" =3 x». Hence we have 


E (E *<) 1/2 + ( " ~ 'f ~ 2> - (" - 2 ) Ed - *i) m 

|5|=n-2 ieS i=l 


> 


E + E XE= 

|5|=n-3,105 |5|=n-2,105 j&S 


1/2 (n — 2)(n — 3) 


+ 


-(n-2)^(l-Xi) 1 / 2 . (29) 


i=2 


By the induction assumption that (28) holds when m = n — 1, we have 


E ( E V) V2 + (n ~ 2) 2 ( "~ 3> > In - 3) Ed - 

|5|=n-3,105 3 'eS 


i=2 


Thus (|29j) is greater than or equal to 
(n — 2)(n — 3) 


E (E= 

|5|=n—2,105 3 ’eS 


+ E (E* 

|5|=n—2,105 jeS 
1/2 


1/2 (n — l)(n — 2 ) 

3 1 + 


(n-2)-^(!-Xi) 1/2 


i=2 


Ed - *-> 1/2 = Ed - *o 1/2 - Ed - x ( ) 1/2 = 0 . 


1=2 


i =2 


i=2 


Thus we have verified the claim that (28) and hence (27) holds for k = m — 2 for all m > 3. To 


establish the case for general 1 < k < m — 2, we again perform induction on the (m, fc)-tuple. Note 
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that the base case to = 3 and k = 1 has been previously proved. Suppose (27) holds for m = n — 1 
and 1 < k < n — 3. Now consider m = n and 1 < k < n — 2. Notice that 


1 (n — 1 


n — k V k — 1 


-l 


E E iwsiii: 


|T|=n—1 \S\=k,ScT 


-1 


=( " _ 1) (fc-1) E E INSHI: 

v 7 |T|=n—1 \S\=k,ScT 


=In — 


1) E 


k- 


|T|=n—1 

By the induction assumption, for all T such that \T\ = n — 1, we have: 

IINmi|| fc + |||z[T]||| fc+2 >2|||z[T]||| fc+1 . 

Therefore 

IIHIIfc + lll Z lllfc+2 - 2 lll z lllfc+l = («-!) (lll Z [ T ]lllfc + lll Z [ T ]lllfc+2 - 2 lll z [ T ]lllfc+l) > 0. 

\T\=n-l 

Thus the claim also holds for m = n and 1 < k < n — 2, completing the proof. 


□ 


A.6 Miscellaneous 

Lemma A.11. (Concavity of 'Ey/ L m (d, k)) Let d E [m]. For k E [to — 2], we have 

Ey/ L m (d, k) + Ey/ L m (d, k + 2) < 2Ey/ L m (d, k + 1). (30) 

where the geometric random variable L m (d,k) is defined as in Definition\3.1\ 


Proof. Suppose we are now sampling without replacement from a pool of numbers with d l’s and 
m — d 0’s. For i E [to], denote by A, E {0,1} the z-tli outcome. It is easy to see that L m (d, k) and 


Xj have the same distribution. To show (30), it suffices to prove the following conditional 


expectation inequality: 

yjL m (d : k ) + 1E[\/ L m (d, k + 2) | L m (d, k)] < 2E[-y/ L m (d, k + 1) | L m [d , /c)] 
Note that the above inequality follows if for all 0 < a < min(d, k ): 

\/o + E -^/a + Afc+i + Afc_|_2 < 2E^a + X k ^ 1 


It is easy to see that 


/-77- d — a , -- d — a . 

Ey a + = — - jy/ a + 1 + (1 — —- — )y/a. 


^ /-77-77- d — a d — a— 1 

Ey a + Afc_|_i + Afc _|_2 =- 7 x 


m — k m — k 

, -- „ d — a m — k — (d — a) , -- 

Vfl T 2 + 2 x -— x - - --- yj a + 1 

TO — K — 1 TO — K TO — fc — 1 


m — k 

^ to — k — (d — a) m — k — (d — a) — 1 


m. — k 


m — k — 1 


46 



























By elementary algebra, it can be shown that 


2E-y/ a + Xk+i — \fa — E-^/ a + X^+i + ^ot+2 

d — a d — a — 1 . , -- , -- 

-— x ---— x (2 \/a + 1 — \/a + 2 — \/a) ^ 0, 

m — k m — k — 1 


The inequality follows since f(x ) = sjx is a concave function. Thus the proof is complete. 


□ 


Lemma A. 12. Let x(f) = (xi(f), ..., x m (t)) T G M m be an m-dimensional function on [ 0 , e) such 
that: ( 1 ) a?i( 0 ) = 1 and for all i > 2 , Xj(0) = 0 ; ( 2 ) The derivative Xi(t) exists and is bounded for 
all t G ( 0 ,e). We have 


lim 

40+ 


x(t)||2 - l|x(0)l| 2 
t 


lim xi (t). 
40+ 


Proof. 


lim 

40+ 


*( t )||2 


*(Q)lb = lim (E£i^W) 1/2 -i 

40+ t 

E m 2 (+\ i 771 

— ((V g? ft)) I / 2 + l )- 1 

40+ t “ 


= 1 i im 

2 40+ t 


= /lim X?(t) ~ 1 + V lim dLl' 

° + “40+ t ' 

l—l 


2 40+ 


t 




2 40+ t 

x\{t) — 1 1 


i =2 


40+ t 


= lim 
40+ t 


1 T 

+ -> Inn 

2 “ 40+ t 

l—l 


By mean value theorem, for each t G ( 0 , e), there exists St G ( 0 , t) such that x\(t) — 1 = xi(St)t. 
Thus the first term simply becomes linm 0 + x\(t). By the same argument, for each i G { 2 , ...,m}, 
Xi(t) = ii(St)t for some St G ( 0 , t). Since Xi(t) is bounded, we have 


lim Xi — = lim Xi(St) 2 t = 0 . 
40+ t 40+ v ' 


Therefore the claim is verified. □ 

Lemma A. 13. Let x(t) = (xi(f),..., x m (t)) T G M m 6 e an m-dimensional function on [ 0 , e) such 
that: ( 1 ) Xi( 0 ) = 0 for all i = 1 ,..., m; (%) The derivative ii(t ) exists /or all t G ( 0 , e). We have 

lim M+k = || lim x(t)|| 2 . 

40+ t 40+ 
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Proof. 


2 


lim 

40 + 


x 


40 + 


% = 1 


SCifil" = (Bss 


i— 1 




40 + t 


2 ) 1/2 


lim x(t)|| 2 . 
40 + 


□ 


Lemma A.14. Lei a = (ai,...,a m ) T E M m where a\ 0 and x(t) = (xi (t), ...,x m (t)) T E M m 6 e 
an m-dimensional function on [0, e) such that: (1) xi(0) = 1 and for all i> 2, Xi( 0) = 0; (2) The 
derivative Xi(t) exists and is bounded for all t E (0, e). We have 

|a T x(i)| — |ai| |i n . . , . , . ... 

lim- = ai lim x\ (t) + sqnia 1 > a* lim xAt). 

40 + t 40 + 40 + 

l—l 


Proof. Without loss of generality, assume a\ > 0. Since xi(0) = 1 and for all i > 2, Xi( 0) = 0, by 
continuity, for sufficiently small t, we have 


|a 7 x(f)| - |ai| _ \aixi(t) + J2™ :2 a i x i(t)\ - ai _ < 4 x 1 (i) - ai + a i x i(t) 


t 


Therefore, by the same argument in the proof of Lemma A. 12 


lim |a r x(f)| - |ai| = Um aixi(f) - ai + Hm y- aiXj(t) 

40 + t 40 + t 40 + “ t 

1=2 


m 

= a\ lim xi(t) + y ai lim xAt). 
40 + 2 —j 40 + 

1=2 


□ 


Lemma A.15. Let a = (ai,..., a m ) T E M m and x(f) = [x\(t),..., x m (t)) T E M m be an Tri¬ 
dimensional function on [0, e) such that: (1) Xj(0) = 0 for all i = 1 (2) The derivative 

Xi(t ) exists for all t E (0, e). We have 


lim 

40 + 


a r x(f)| 

t 


nn 


E 


ai lim xAt) 
40 + 


Proof. The proof is similar to that of Lemma A. 13 


□ 
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